ropensci / biomartr

Genomic Data Retrieval with R
https://docs.ropensci.org/biomartr
216 stars 29 forks source link

How to download all genomes that belong to a specific taxa? #68

Open johanneswerner opened 3 years ago

johanneswerner commented 3 years ago

Hello,

cool package, thank you. :-)

Is there a possibility to download all genomes that belong to a certain genus or family? The way I see it, I cannot use the taxid of the genus or family as this taxid does not has a genome and has no information about descending taxa.

I was trying first with the genus name, however in the case of Enterococcus, it also returns me lots of Enterococcus phage genomes which I do not want to download here.

Thank you for your help.

HajkD commented 3 years ago

Hi Johannes,

since this is a taxonomic classification issue, did you by any chance have a look at the taxize package and see what happens there when you insert your particular case?

Since biomartr is solely relying on internal data provided by NCBI could you also check there what kind of records they store for your example. This will help me find strategies to capture such particular cases.

Many thanks, Hajk

johanneswerner commented 3 years ago

I did not work with taxize yet, I got the idea also from the ncbi-genome-download package.

They have a python script (which uses ete3) that gets the descending taxa based on a specific taxid, see here.

The script is available here: https://github.com/kblin/ncbi-genome-download/blob/master/contrib/gimme_taxa.py

In any way, I will look into taxize, maybe I get an idea from there as well, thank you for the suggestion.

johanneswerner commented 3 years ago

Possibly linked with https://github.com/ropensci/biomartr/issues/6#issuecomment-788771908

johanneswerner commented 3 years ago

https://github.com/shenwei356/taxonkit/issues/41

johanneswerner commented 3 years ago

From the referenced issue

I wrote Python bindings for TaxonKit recently, if you’d like to use that as a reference. I primarily used the Popen construct in Python’s subprocess library to call TaxonKit, and then loaded the output into dataframes (pandas) for handling in Python.

https://github.com/bioforensics/pytaxonkit/blob/9746225b1c0a9eff708790037e3b53e5d45ac235/pytaxonkit.py#L203-L211

It’s been a long time since I did any serious work in R, so I’m not sure what the best tools are for system calls. But I imagine R’s native dataframes would be suitable for storing most results.

I hope this helps.

I fear this is not going to be pretty - not that it were too hard to use system() in R to invoke shell commands, but I severely doubt that this will work platform-independently (especially on Windows). @HajkD I would certainly be open for suggestions (unless it is okay to not support Windows ...). If the conda recipe still is going to be built, we can add the funcitonality and add taxonkit as requirement.

mr-eyes commented 2 years ago

This might help! I am getting accessions for the provided organism name/taxon or the closest one in the lineage.

https://gist.github.com/mr-eyes/92d6172c7a5c7d5bd35fcff6f765d48d

zachary-foster commented 1 year ago

Hello All,

Saw this issue on Slack. Not sure if this is helpful or not, but I have done this with entrez eutils command line tools. I imagine the same functionality exists in rentrez package for R? Here is the script we used. For our purpose, it was in two steps: 1) get CSV of info for all genomes available for each taxon, 2) download a subset of them. This is from a nextflow module, so the bash might look a bit odd, but should give you the idea.

1) Make CSV with info for all genomes for a given taxon:

esearch -db taxonomy -query "${taxon} OR ${taxon}[subtree]" | \\
        elink -target assembly | \\
        efilter -query "latest[PROP] AND full-genome-representation[PROP] AND has-annotation[PROP] NOT excluded-from-refseq[PROP]" | \\
        efetch -format docsum | \\
        xtract -pattern DocumentSummary -def 'NA' -element \$COLS >> \\
        ${prefix}.tsv

2) Download genome for each row in CSV:

datasets download genome accession $id --include gff3,rna,cds,protein,genome,seq-report --filename ${prefix}.zip
salix-d commented 1 year ago

The taxize package should be able to get the children, but since you mentionned:

Since biomartr is solely relying on internal data provided by NCBI

Do you get the taxdump? If so with names.dmp & nodes.dmp you can make an sqlite db that makes it easy to get the parents/children of an accession/tax id. The taxonomizer package does it. Might also be how taxize does it, I don't remember.