Closed paulimer closed 7 months ago
Yes, this would definitely be quite useful!
Hi @paulimer,
Thanks for opening this issue.
Is it possible to get the classification alongside the genomes ?
We are currently working on providing exactly this information through the datasets command-line tool and we hope to release this to the public near the end of March.
In the meantime, you can get this information using the rankedlineage.dmp
file, available as part of the new_taxdump.zip archive available on FTP. For more details about the contents of this file, see the README in the same FTP directory.
Here's what I would recommend:
Get a list of TaxIDs for the set of assembled genomes for the family Gruidae
datasets summary genome taxon gruidae --as-json-lines | dataformat tsv genome --fields organism-tax-id --elide-header | sort -u > tax-ids.list
For each TaxID in the list, get the genus from rankedlineage.dmp
and generate a table of two columns, where the first column is the TaxID and the second column is the genus
while read taxid; do awk -v taxid="$taxid" -F '|' '($1==taxid){print $1,$4}' rankedlineage.dmp; done < tax-ids.list
100784 Balearica
2717088 Antigone
30415 Grus
40817 Grus
40818 Grus
9117 Grus
925459 Balearica
Thanks a lot Eric ! I wonder how I missed rankedlineage.dmp
, thanks for pointing it out. Great to see this will make it to the tool, and in the meantime your solution looks perfect.
Hi all, Thanks for the development of this great tool ! I had a question related to getting both taxonomical information and genomes. Using the CLI, I managed to download all genomes for my taxon of interest (Family level). I looked at the available information in the
assembly_data_report.jsonl
, but I wasn't able to find what I am looking for : Is it possible to get the classification alongside the genomes ? For instance, I downloaded all the genomes for a specific Family, but I am also interested in the Genus classification.nodes.dmp
) I found contain Taxonomic IDs and their relative parent, meaning that in my case I would look at a quite unpractical recursive hierarchical join (because all Taxonomic IDs of the data report are not at the same level, some are strain, other subspecies, other species...).Best, Paul