shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
369 stars 29 forks source link

Can I filter taxids using a set of taxids? #44

Closed Lix1993 closed 3 years ago

Lix1993 commented 3 years ago

I have a gene2taxid file, like this:

qseqid  staxids
TRINITY_DN18_c0_g2      6973
TRINITY_DN18_c0_g3      6973
TRINITY_DN18_c0_g1      6973
TRINITY_DN18_c1_g1      189913
TRINITY_DN59_c0_g1      4577
TRINITY_DN79_c0_g1      4577
TRINITY_DN17_c0_g1      4577
TRINITY_DN46_c0_g1      81932
TRINITY_DN46_c2_g1      2020875

and a microbes taxid file, like this:

2       Bacteria                                                                                                                                                                        
2157    Archaea
10239   Viruses
33630   Alveolata
554915  Amoebozoa
5794    Apicomplexa
554296  Apusozoa
1401294 Breviatea
193537  Centroheliozoa
3041    Chlorophyta
28009   Choanoflagellida
190322  Collodictyonidae
3027    Cryptophyta
5758    Entamoeba
33682   Euglenozoa
207245  Fornicata
4751    Fungi

Can I get microbes genes using taxonkit?

shenwei356 commented 3 years ago

Yes, it's very easy.

  1. Getting descendants' TaxIds of given microbes TaxIds:

    $ taxonkit list --ids $(cut -f 1 microbe-taxids.tsv | paste -sd ,) \
        --indent "" \
        -o microbe-taxids.extended.tsv
    
    $ wc -l microbe-taxids.extended.tsv
    955966 microbe-taxids.extended.tsv
    
    $ head -n 5 microbe-taxids.extended.tsv
    2
    1224
    1236
    33811
    2707
  2. Filtering query sequence by TaxId, with help of csvtk:

    $ csvtk grep  -t -f staxids -P microbe-taxids.extended.tsv gene2taxid.tsv 
    qseqid  staxids
    TRINITY_DN46_c0_g1      81932
    TRINITY_DN46_c2_g1      2020875
Lix1993 commented 3 years ago

thanks