Integrate IBA GAF generation into GO pipeline

dustine32 commented 5 years ago

This can be done a number of different ways:

Package dumps of PAINT DB to Zenodo release (50G max). GO pipeline will be configured to pull Zenodo release and stand up a Docker instance to load DB and gen GAFs.
Develop API service to allow GO pipeline to query PAINT DB
PAINT update pipeline dumps GAF generation intermediate files (the files produced by these lines) for GO pipeline to pickup and GAFs are generated from these.

Tagging @kltm @huaiyumi

kltm commented 5 years ago

@dustine32 It would be interesting to know to what size the full load compresses to. It may be that transport and reuse is not that large a problem.

dustine32 commented 5 years ago

@kltm Oh hey, I was just about to include this anyway, how convenient! How big is a compressed dump of our PAINT DB? The 2019-06-12 compressed tar is 5.3 GB.

huaiyumi commented 5 years ago

We can bring Anushya @mugitty into the conversation. She dealt with the TAIR group about the database dump for the phylogene project. She is also maintaining the current PANTHER web service.

mugitty commented 5 years ago

We should be able to add new services to the PAINT server to generate IBA GAF's.

kltm commented 5 years ago

For the record, our Zenodo dumps for GO are around 40GB, so that would not be an issue either way: https://doi.org/10.5281/zenodo.3267438

dustine32 commented 5 years ago

For consideration, this is roughly what is currently done to generate IBA GAFs:

A set of 6 queries are run against the PAINT DB to generate input data files for the createGAF.pl script.
- paint_annotation (~2 min to generate a 3 MB file)
- paint_annotation_qualifier
- paint_evidence
- go_aggregate (~2-3 min to generate a 0.5 GB file)
- organism_taxon
A few other files that are Panther version-specific (e.g. 13.1, 14.1) are also used. These are currently pulled from our HPC:
- gene.dat (227 MB)
- node.dat (137 MB)
- Two TAIR/Araport ID lookup files (1 MB and 25 KB) - could be checked into repo
- The entire library of tree_node files (282 MB) - used to load tree structure to propagate ancestor annotations to descendant species nodes.
The createGAF.pl script is run (~10 min), loading all of these files into memory, to generate and IBD file and MOD/species-specific (other than the _other GAF) IBA GAF files.

The HPC-stored input files are used to grab data that could just as well be queried from the DB except for the TAIR/Araport files, which are small enough to be committed to a repo (the Araport file already is).

dustine32 commented 5 years ago

If we go the API route, the current compute time (~15-20 min) would be a concern unless we precomputed the GAFs (like we do now each month) and just returned them in the response. Or we could try rewriting the createGAF.pl functionality to make it faster.

I guess this is assuming the API would be responsible for the whole shebang. It could just be used to query (one query/request) the input data and the createGAF.pl functionality would be run by the requester. With the current queries, we're looking at up to 3 min.

dustine32 commented 5 years ago

From the 2019-09-05 GO software call, it sounds like the main problem is not having the ability to test different versions of IBA GAFs in the GO pipeline. I think this can pretty much be solved by just creating a folder naming convention to organize our GAF releases.

@kltm I created a separate issue #34 for the versioned folder structure.

pantherdb / fullgo_paint_update

Integrate IBA GAF generation into GO pipeline #33