ropensci / biomartr

Genomic Data Retrieval with R
https://docs.ropensci.org/biomartr
210 stars 29 forks source link

skip_bacteria = FALSE resulting in unexpected behavior in is.genome.available() #105

Closed JPReceveur closed 9 months ago

JPReceveur commented 9 months ago

Hello, very helpful package!

Just wanted to bring attention to a potential bug that I noticed with the recent release. In the function is.genome.available() if the argument skip_bacteria = FALSE is specified, it still results in the bacterial download being skipped. From a quick look, it looks like the argument for skip_bacteria from the is.genome.available() function may not be being passed to the underlying function.

e.g. is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq",skip_bacteria = FALSE) results in bacteria being skipped.

When I run the function getKingdomAssemblySummary(db = "refseq", skip_bacteria = FALSE) it runs as expected with bacteria being downloaded. Took care of my issue by running the getKingdomAssemblySummary() prior to running is.genome.available()

I'm on version 1.0.5 off CRAN

HajkD commented 9 months ago

Dear @JPReceveur

Thank you so much for pointing this bug out to us and you were absolutely correct, the argument skip_bacteria was not passed on internally to the new internal function is.genome.available.refseq.genbank(). This is fixed now and please let me know if it works for you now.

With many thanks and very best wishes, Hajk

JPReceveur commented 9 months ago

Works great, thanks!

epartan commented 2 weeks ago

I believe the same issue is occurring for getCollection: the bacterial reference is skipped regardless of the 'skip_bacteria = FALSE' argument

> biomartr::getCollection(organism = "Acinetobacter baumannii",
+                         skip_bacteria = FALSE)
-> Starting collection retrieval (genome, proteome, cds, rna, gff, repeat_masker, assembly_stats) for Acinetobacter_baumannii ...
It seems that this is the first time you run this command for refseq .
Thus, 'assembly_summary.txt' files for all kingdoms will be retrieved from refseq. 
Don't worry this has to be done only once if you don't restart your R session.

Due to its extended dataset size (GenBank: >700 MB, RefSeq: >150 MB) Kingdom 'bacteria' will not be downloaded by default anymore. To also include 'bacteria' please specify the argument 'skip_bacteria = FALSE'

-> Starting download for: archaea
--------> Skipping bacteria download .....                                                     

-> Starting download for: fungi
-> Starting download for: invertebrate                                                         
-> Starting download for: plant                                                                
-> Starting download for: protozoa                                                             
-> Starting download for: vertebrate_mammalian                                                 
-> Starting download for: vertebrate_other                                                     
-> Starting download for: viral