khughitt / EuPathDB

EuPathDB recipe for the Bioconductor AnnotationHub
2 stars 4 forks source link

How to get current release (41) data of fugiDB through annotationhub #5

Closed cparsania closed 5 years ago

cparsania commented 5 years ago

Hi,

I am using annotation hub to use fungidb data. All the fungidb data (object class : OrgDB and GRages) are of fungidb-release 39. However, current release of fungi db is 41. I wonder, how can I use fungidb release 41 data through annotation hub.

Thanks.

khughitt commented 5 years ago

Greetings!

The EuPathDB AnnotationHub resources will be updated with each new Bioconductor release (which in turn, follows the R release cycle), so the updated versions should be available sometime around April - May.

If you want to access the resources sooner, you can also use this package to generate the resources locally by modifying the scripts to only build resources for your species(s) of interest.

Cheers,

abelew commented 5 years ago

Hi, If you wish to generate an orgdb/etc directly from eupathdb/fungidb, invoke a variant of one or more of the following (organismdbi in turn invokes the txdb and orgdb functions):

make_eupath_orgdb(species="substring_of_species_name", webservice="fungidb") make_eupath_txdb(species="substring_of_species_name", webservice="fungidb") make_eupath_organismdbi(species="substring_of_species_name", webservice="fungidb") make_eupath_bsgenome(species="substring_of_species_name", webservice="fungidb")

Hopefully there are useful examples in the vignette. If you have any troubles, please drop a line and I can poke at it with you.

cparsania commented 5 years ago

Thanks @khughitt and @abelew . Very useful. As @abelew mentioned, i ran the command. I attached the log and command below. However, It seems like download failed.

> tt <- make_eupath_orgdb(species="Aspergillus nidulans", webservice="fungidb")
trying URL 'https://fungidb.org/fungidb/webservices/OrganismQuestions/GenomeDataTypes.json?o-fields=all'
trying URL 'http://fungidb.org/fungidb/webservices/OrganismQuestions/GenomeDataTypes.json?o-fields=all'
downloaded 924 KB

Downloaded: http://fungidb.org/fungidb/webservices/OrganismQuestions/GenomeDataTypes.json?o-fields=all
Found the following hits: Aspergillus nidulans FGSC A4, choosing the first.

Warning message:
In download.file(url = request_url, destfile = metadata_json) :
  cannot open URL 'https://fungidb.org/fungidb/webservices/OrganismQuestions/GenomeDataTypes.json?o-fields=all': HTTP status was '404 Not Found'
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Operation was aborted by an application callback

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Operation was aborted by an application callback
Getting the set of possible genes.
  |===================================================================================================================================================| 100%
Downloading orthologs one gene at a time. Checkpointing if it fails.
  |                                                                                                                                      |   0%Downloading:  1/0, and checkpointing to ortho_checkpoint_FungiDB-41_AnidulansFGSCA4.rda
Downloading:  0/0, and checkpointing to ortho_checkpoint_FungiDB-41_AnidulansFGSCA4.rda

Saving annotations to eupathdb/FungiDB-41_AnidulansFGSCA4ortholog_table.rda

Warning messages:
1: In get_orthologs_one_gene(species = species, gene = gene, entry = entry) :
  There is a missing parameter.
2: In get_orthologs_one_gene(species = species, gene = gene, entry = entry) :
  There is a missing parameter.
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Operation was aborted by an application callback

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Operation was aborted by an application callback
Error in loadNamespace(name) : there is no package called ‘KEGGREST’
Error in as.data.frame.default(vector, stringsAsFactors = FALSE) : 
  cannot coerce class ‘"try-error"’ to a data.frame
Warning message:
In make_eupath_orgdb(species = "Aspergillus nidulans", webservice = "fungidb") :
  Unable to create an orgdb for this species.
abelew commented 5 years ago

Hi, I am checking it out now by adding a local/test_001_anidulans.R. My initial guess comes from the fact that not all eupathdb services have yet moved to https. I have in place some checks for that; but I do not think I have every query wrapped yet to fall back to http in case of failure. Ideally I should have a commit pushed momentarily which takes care of this.

abelew commented 5 years ago

Ah it turns out that either the eupathdb folks changed how they return a gene with no orthologs, or I did not account for it properly. But either way it seems to be running happily now.

I think it will finish generating the orgdb in the next 10-20 minutes. I will upload it to: http://www.umiacs.umd.edu/~abelew/eupathdb/ when it is finished; though it might be interesting to see if you get a different result than I did in my testing.

abelew commented 5 years ago

Sorry I took so long, I uploaded a tarball of the orgdb package. On my machine at least, it installed via r cmd install without shenanigans.

After installing it, I used load_eupath_annotations() and got back a data frame with 20,367 rows and 78 columns. Full details are in test_001_anidulans.R.

cparsania commented 5 years ago

I downloaded the tarball from the link you gave. But, I couldn't find the script test_001_anidulans.R. Also, I want all the fungi db species (OrgDb and GRanges object) of version 41 not just anidulans. Can I do using this package (khughitt/EuPathDB) ?

abelew commented 5 years ago

Ah yeah, that is in my fork of the repository. I need to PR it. With respect to the set of all species, I have been meaning to do that in preparation for the next release but haven't. I will start that process now while I am thinking about it. But to answer your question, yes. The scripts/ directory in the package contains the scripts Keith and I wrote for that purpose.

With that in mind, I am queuing up a generator for the rest of the fungidb now on our cluster. I think I will try parallelizing it and see if I make the eupathdb webservers sad.

abelew commented 5 years ago

I have the set of all fungidb packages generating now. I used it as an opportunity to rip out some of the redundant code which was left behind by my original implementation of this. As it stands, my test file is creating a list with the result of each creation of a txdb, orgdb, organismdbi, bsgenome, and granges. Upon completion, it will save this list as an rda file; if you wish, I can upload it. It will therefore have each granges object as an element of the list at position: retlist[[species]][["granges"]]. I will happily also upload the set of orgdbs, but that might require some shenanigans, as I think they are larger than my quota. Though at that point I suppose the data will be ready to send to annotationhub; so you may wish to simply take it from them.

abelew commented 5 years ago

An update: I have the fungidb species running. It got interrupted due to some difficulties in extracting the ortholog tables. I have therefore been writing with the very kind and patient person at the EuPathDB who seems to get stuck with all my questions. They are making some changes to speed up some of these queries, so I am hopeful that all the fungidb species will be ready relatively soon (before it was finishing ~ 5 species / day on one node, once these changes get propagated I think it will be able to do 40-50 / day / node).

abelew commented 5 years ago

The 140 species from FungiDB have finished. I am now generating the rest of the eupathdb (227 more to go!). The full set of packages are 14G. I am copying them now to:

http://128.8.184.22/~trey/

If you get the itch to grab them, please give me a heads up when you are finished I will clear it out and kill the web server.

cparsania commented 5 years ago

Hi @abelew Big thanks !!

I have one question. Can you tell me which the release number of fungi db you have used to generate all the data ? The latest release is 41.

Also, can I download them through annotationhub ? or I have to download each of them individually ?

cparsania commented 5 years ago

I compared org.Anidulans.FGSC.A4.v42.eg.db vs v39 obtained from annotationHub. 42 seems much updated. However, I have one concerns if you can fix.

v42 has several new columns added to orgdb compared to 39. Some of them have just different column names but content is same. You can see them in table below. I wonder, if you keep them consistent depended package will not break.

v39_colnames V42_colnames
GO_ID GO_GO_ID
GO_TERM_NAME GO_GO_TERM_NAME
EVIDENCE_CODE GO_EVIDENCE_CODE
ONTOLOGY GO_ONTOLOGY

One more thing is gene description column. In v39 it was given under name GENEDESCRIPTION which I cannot find in v42.

cparsania commented 5 years ago

I tried to from the link you provided. 15 of them couldn't download.

Warning messages:
 cannot open URL 'http://128.8.184.22/~trey/org.Cneoformans.var.grubii.H99.v42.eg.db_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/org.Cneoformans.var.grubii.KN99.v42.eg.db_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/org.Cneoformans.var.neoformans.JEC21.v42.eg.db_2019.03.tar.gz': HTTP status was '500 Internal ServerError'

  cannot open URL 'http://128.8.184.22/~trey/org.Pcinnamomi.var.cinnamomi.CBS.144.22.v42.eg.db_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/org.Pultimum.var.sporangiiferum.BR650.v42.eg.db_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

Warning messages:

  cannot open URL 'http://128.8.184.22/~trey/BSGenome.Cryptococcus.neoformans.var.grubii.H99.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/BSGenome.Cryptococcus.neoformans.var.grubii.KN99.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/BSGenome.Cryptococcus.neoformans.var.neoformans.B.3501A.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/BSGenome.Cryptococcus.neoformans.var.neoformans.JEC21.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/BSGenome.Phytophthora.cinnamomi.var.cinnamomi.CBS.144.22.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/TxDb.Cryptococcus.neoformans.var.grubii.H99.FungiDB.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/TxDb.Cryptococcus.neoformans.var.grubii.KN99.FungiDB.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/TxDb.Cryptococcus.neoformans.var.neoformans.JEC21.FungiDB.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/TxDb.Phytophthora.cinnamomi.var.cinnamomi.CBS.144.22.FungiDB.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'

  cannot open URL 'http://128.8.184.22/~trey/TxDb.Pythium.ultimum.var.sporangiiferum.BR650.FungiDB.v42_2019.03.tar.gz': HTTP status was '500 Internal Server Error'
cparsania commented 5 years ago

Why there is no GRanges object in the link you provided ?

abelew commented 5 years ago

I will respond in reverse order because I get confused. I forgot to copy the GRanges rda files; the is actually because I sort of keep forgetting they exist because they come directly from calling rtracklayer::import.gff3(gff_file), as a result in my own work I just make sure to have the gff files available. I will copy them over once I remember where they are (probably 5-10 minutes).

With respect to the weird download errors, Apache's mod-negotiation was confused by those filenames and thought the '.var' in the filenames was telling it that they were another language, I just disabled mod-negotiation and they should work fine now.

Finally, your first queries: These are eupathdb release 42 packages.

The column names are an interesting concern. I changed them because there are duplicate columns in different tables with the same name. In order to avoid the resulting collision, I prefixed the column names with their home table. Thus you will find columns with the following prefixes (and their source):

  1. ANNOT_ : These columns are all acquired by querying the eupathdb annotation table and provide a majority of the likely interesting material. For the interested, these are acquired by querying the eupathdb webservice 'GenesByMolecularWeight.xml' with a very loose set of criteria (eg. 100 daltons to a 100 billion daltons), querying for the set of available columns, and then asking for all of them.
  2. GO_ : These columns come from the eupathdb GO table. For a time I had them set up to avoid names like 'GO_GO_ID' but I stopped it for some reason which I do not remember. It would be trivial for me to once again make sure this is 'GO_ID'. When using the organismdbi package, these are cross referenced to the GO.db package.
  3. INTERPRO_ : The interpro eupathdb data is dumped into these.
  4. KEGGREST_: When possible, I perform a separate query to the KEGGREST service to extract the relevant KEGG data.
  5. PATHWAY_: The reactome table. When one uses the organismdbi package, these are cross referenced to the reactome.db package as a result.

Finally, I think the eupathdb folks renamed the gene description column to 'gene_product', and as such you will find it under 'ANNOT_GENE_PRODUCT'; in addition, the 'ANNOT_PFAM_DESCRIPTION' column might be of interest.

Oh, I skipped one other question: These packages are not yet available in AnnotationHub. I am hoping to learn how to upload them shortly from the AnnotationHub folks; though I have not yet finished generating all of the other eupathdb packages, so it will need to wait until those are complete (probably later today or tomorrow, last I looked it had finished generating 280 of 346).

I hope this helps. atb

abelew commented 5 years ago

The rda GRanges files are copied. I ended up regenerating a few that I accidentally deleted.

abelew commented 5 years ago

Do you mind if I close the issue? I think that if you are ok with the changed column names, then everything is complete.

cparsania commented 5 years ago

Sure. Thanks a lot for your support. I really appreciate.

cparsania commented 5 years ago

Hi, Sorry for troubling you again. As you mentioned previously, latest fungiDB data will be added to AnnotationHub with new bioconductor release. The new bioconductor release (3.9) is out now. So I updated both bioconductor and AnnotationHub. The fungiDB data available in the latest release is still V39.

Another surprising this is, there is no OrgDB object available in AnnotationHub now. As you can see below rdataclass just contains GRanges and no other objects. Can you throw some light on this discrepancy ?

>   hub <- AnnotationHub()
snapshotDate(): 2019-05-01
>   hub_subset <- query(hub , "fungidb")
> hub_subset
AnnotationHub with 114 records
# snapshotDate(): 2019-05-01 
# $dataprovider: FungiDB
# $species: Fusarium oxysporum, Histoplasma capsulatum, Cryptococcus gattii, Cryptococcus neoformans, Candida albicans, Coccidioid...
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded, preparerclass,
#   tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH65267"]]' 

            title                                                                  
  AH65267 | Aspergillus versicolor CBS 583.65 transcript information               
  AH65268 | Aspergillus sydowii CBS 593.65 transcript information                  
  AH65269 | Phytophthora cinnamomi var. cinnamomi CBS 144.22 transcript information
  AH65272 | Aspergillus wentii DTO 134E9 transcript information                    
  AH65273 | Aspergillus zonatus CBS 506.65 transcript information                  
  ...       ...                                                                    
  AH65459 | Candida albicans SC5314_B transcript information                       
  AH65468 | Cryptococcus neoformans var. grubii KN99 transcript information        
  AH65496 | Magnaporthe oryzae BR32 transcript information                         
  AH65512 | Phytophthora ramorum strain Pr102 transcript information               
  AH65516 | Phytophthora sojae strain P6497 transcript information   

Thanks a lot .

abelew commented 5 years ago

Greetings, I do not know, but I can say that uploaded the set of eupathdb orgdb/granges/etc data to the AnnotationHub S3 instance last week. I was not playing to the Bioconductor schedule, which I must admit I had completely forgotten. Therefore I would guess that the v42 versions are in the process of being populated by AnnotationHub. The person with whom I have been conversing at bioconductor/AnnotationHub stated that she would check out the submissions this week (probably Monday or Tuesday). She has not written, thus I assumed that everything went smoothly. My inclination would be to suggest waiting a bit longer and see if the new versions appear. Conversely, I still have the set of generated packages and would happily provide them to you. Finally, the eupathdb just released v43, weirdly just 1 month after v42. This new version may have introduced a couple of minor oddities when I went to check on some annotations (notably what seemed like a name change for Trypanosoma cruzi Esmeraldo, but might have only been a typeo on my part). I suspect this is not a helpful response, but hopefully it does shed some light. If you are in a rush, I can put back up my web server and provide the v42 packages individually to you, or generate v43; which might be of interest specifically to you as it contains 7 new species/strains! I hope this finds you doing well, atb

cparsania commented 5 years ago

Hi, Thanks for prompt reply. Anyway, I have V42 data which you uploaded previously. I can wait for few days if it is available through AnnotationHub as it will help me to maintain my downstream code.

If needed I will ask you.

Thanks a lot, ~C.

cparsania commented 5 years ago

Hi,

Now I can see OrgDb and GRanges object from fungidb (V42) in AnnotationHub, which is fantastic. However, I cannot download OrgDB but GRanges can.

I submitted the issue on AnnotationHub. If you can look into and resolve.

Many thanks Chirag.