Open dustine32 opened 5 years ago
@dustine32 It would be interesting to know to what size the full load compresses to. It may be that transport and reuse is not that large a problem.
@kltm Oh hey, I was just about to include this anyway, how convenient! How big is a compressed dump of our PAINT DB? The 2019-06-12 compressed tar is 5.3 GB.
We can bring Anushya @mugitty into the conversation. She dealt with the TAIR group about the database dump for the phylogene project. She is also maintaining the current PANTHER web service.
We should be able to add new services to the PAINT server to generate IBA GAF's.
For the record, our Zenodo dumps for GO are around 40GB, so that would not be an issue either way: https://doi.org/10.5281/zenodo.3267438
For consideration, this is roughly what is currently done to generate IBA GAFs:
_other
GAF) IBA GAF files.The HPC-stored input files are used to grab data that could just as well be queried from the DB except for the TAIR/Araport files, which are small enough to be committed to a repo (the Araport file already is).
If we go the API route, the current compute time (~15-20 min) would be a concern unless we precomputed the GAFs (like we do now each month) and just returned them in the response. Or we could try rewriting the createGAF.pl functionality to make it faster.
I guess this is assuming the API would be responsible for the whole shebang. It could just be used to query (one query/request) the input data and the createGAF.pl functionality would be run by the requester. With the current queries, we're looking at up to 3 min.
From the 2019-09-05 GO software call, it sounds like the main problem is not having the ability to test different versions of IBA GAFs in the GO pipeline. I think this can pretty much be solved by just creating a folder naming convention to organize our GAF releases.
@kltm I created a separate issue #34 for the versioned folder structure.
This can be done a number of different ways:
Tagging @kltm @huaiyumi