shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
369 stars 29 forks source link

Feature request: accession2taxid #74

Closed stas-malavin closed 1 year ago

stas-malavin commented 1 year ago

Hi, I'd love to have a possibility to assign taxid's to accession numbers locally using NCBI's nucl_gb.accession2taxid.gz, nucl_wgs.accession2taxid.gz, and prot.accession2taxid.gz. This is possible with an R package taxonomizr.

shenwei356 commented 1 year ago

csvtk is enough for mapping accession2taxid.

Retrieving accession2taxid data of accessions of interest.


cat prot.accession2taxid.gz \
    | csvtk grep -t -f accession.version -P acc.txt  \
    | csvtk cut -t -f accession.version,taxid  \
    | csvtk del-header -t \
    > prot.accession2taxid.tsv \

Querying taxid for each accession.

cat data.tsv \
    | csvtk mutate -H -t -f 1 \
    | csvtk replace -H -t  -f 2 -k prot.accession2taxid.tsv -p '(.+)' -r '{kv}' 
stas-malavin commented 1 year ago

That's beautiful, thank you! I really need to master csvtk!