ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
349 stars 39 forks source link

Is there a way to get both genome and taxonomy information ? #315

Closed paulimer closed 7 months ago

paulimer commented 7 months ago

Hi all, Thanks for the development of this great tool ! I had a question related to getting both taxonomical information and genomes. Using the CLI, I managed to download all genomes for my taxon of interest (Family level). I looked at the available information in the assembly_data_report.jsonl, but I wasn't able to find what I am looking for : Is it possible to get the classification alongside the genomes ? For instance, I downloaded all the genomes for a specific Family, but I am also interested in the Genus classification.

Best, Paul

fgvieira commented 7 months ago

Yes, this would definitely be quite useful!

ericcox1 commented 7 months ago

Hi @paulimer,

Thanks for opening this issue.

Is it possible to get the classification alongside the genomes ?

We are currently working on providing exactly this information through the datasets command-line tool and we hope to release this to the public near the end of March.

In the meantime, you can get this information using the rankedlineage.dmp file, available as part of the new_taxdump.zip archive available on FTP. For more details about the contents of this file, see the README in the same FTP directory.

Here's what I would recommend:

  1. Get a list of TaxIDs for the set of assembled genomes for the family Gruidae datasets summary genome taxon gruidae --as-json-lines | dataformat tsv genome --fields organism-tax-id --elide-header | sort -u > tax-ids.list

  2. For each TaxID in the list, get the genus from rankedlineage.dmp and generate a table of two columns, where the first column is the TaxID and the second column is the genus

    while read taxid; do awk -v taxid="$taxid" -F '|' '($1==taxid){print $1,$4}' rankedlineage.dmp; done < tax-ids.list
    100784      Balearica
    2717088     Antigone
    30415       Grus
    40817       Grus
    40818       Grus
    9117        Grus
    925459      Balearica
paulimer commented 7 months ago

Thanks a lot Eric ! I wonder how I missed rankedlineage.dmp, thanks for pointing it out. Great to see this will make it to the tool, and in the meantime your solution looks perfect.