ropensci / biomartr

Genomic Data Retrieval with R
https://docs.ropensci.org/biomartr
212 stars 29 forks source link

getGenome function not working for viruses? #29

Closed pedur1 closed 6 years ago

pedur1 commented 6 years ago

Hi Hajk-Georg

I am trying to retrieve various reference genomes of viruses held in Genbank, or more specifically NCBI's RefSeq database.

whilst biomartr 0.7.0 seems to work well for most taxons - it does not seem to perform for viruses

here is code in which I compared a bacterial pathogen (Mycobacterium tuberculosis) with a virus (rabies virus)

is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq")

is.genome.available(organism = "Rabies lyssavirus", db = "refseq")

both of these came back with the response "TRUE"

However, when I try fetching these two reference sequences using the following code:

Mtb.genome.refseq <- getGenome(db = "refseq", organism = "Mycobacterium tuberculosis",  path   = file.path("_ncbi_downloads","genomes"))

RABV.genome.refseq <- getGenome(db = "refseq", organism = "Rabies lyssavirus", path   = file.path("_ncbi_downloads","genomes"))

The M. tunberculosis fetch is successful, but the rabies virus fetch fails with the following message:

----------> No reference genome or representative genome was found for 'Rabies lyssavirus'. Thus, download for this species has been omitted. Have you tried to specify 'reference = FALSE' ?

adding the argument reference = FALSE does not solve the problem.

Thanks

PS. I am running the program in RStudio 1.1.447, R 3.5.0, Windows 10.

HajkD commented 6 years ago

Hi @pedur1,

Thank you very much for making me aware of this problem.

The reason why it didn't work could be due to the naming convention for viruses. Unfortunately, the scientific naming for viruses is still a wild west, because they include strain numbers, strain names, dots, (, [, etc without any standard or convention (this is also true for bacteria). This is very hard to standardize and so whenever I build a data retrieval query using the scientific name, sometimes the API doesn't recognize it and thus the retrieval stops.

I tried to somewhat overcome this limitation now buy revising my retrieval backend which should now work for your command:

Mtb.genome.refseq <- getGenome(db = "refseq", organism = "Mycobacterium tuberculosis",  path   = file.path("_ncbi_downloads","genomes"))

RABV.genome.refseq <- getGenome(db = "refseq", organism = "Rabies lyssavirus", path   = file.path("_ncbi_downloads","genomes"), reference = FALSE)

Please let me know if it works for you now using the developer version of biomartr:

source("http://bioconductor.org/biocLite.R")
biocLite("ropensci/biomartr")

In the new version of biomartr you will now also see that when you type

is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq")
A reference or representative genome assembly is available for 'Mycobacterium tuberculosis'.
More than one entry was found for 'Mycobacterium tuberculosis'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Mycobacterium tuberculosis, db = refseq, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.

a new message system will make you aware that there are different strains of Mycobacterium tuberculosis available at RefSeq. By specifying the argument details = TRUE you can retrieve the names, accession ids and species_taxid of these strains and specify them in the getGenome() function.

E.g.

is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq", details = TRUE)
# A tibble: 5,377 x 21
   assembly_accessi… bioproject  biosample  wgs_master   refseq_category
   <chr>             <chr>       <chr>      <chr>        <chr>          
 1 GCF_000008585.1   PRJNA224116 SAMN02603… NA           na             
 2 GCF_000016145.1   PRJNA224116 SAMN02603… NA           na             
 3 GCF_000016925.1   PRJNA224116 SAMN00103… NA           na             
 4 GCF_000023625.1   PRJNA224116 SAMN00763… NA           na             
 5 GCF_000152105.1   PRJNA224116 SAMN02595… ADHQ0000000… na             
 6 GCF_000153685.2   PRJNA224116 SAMN03081… NA           na             
 7 GCF_000154585.2   PRJNA224116 SAMN03081… NA           na             
 8 GCF_000154605.2   PRJNA224116 SAMN00103… NA           na             
 9 GCF_000155795.1   PRJNA224116 SAMN02595… ABVM0000000… na             
10 GCF_000159755.1   PRJNA224116 SAMN02595… ACHP0000000… na             
# ... with 5,367 more rows, and 16 more variables: taxid <int>,
#   species_taxid <int>, organism_name <chr>, infraspecific_name <chr>,
#   isolate <chr>, version_status <chr>, assembly_level <chr>,
#   release_type <chr>, genome_rep <chr>, seq_rel_date <date>,
#   asm_name <chr>, submitter <chr>, gbrs_paired_asm <chr>,
#   paired_asm_comp <chr>, ftp_path <chr>, excluded_from_refseq <chr>
Mtb.genome.refseq <- getGenome(db = "refseq", organism = "1773",  path   = file.path("_ncbi_downloads","genomes"))

I hope this helps.

Many thanks and best wishes, Hajk