shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
381 stars 30 forks source link

taxonkit equivalent of diamond blastx plus megan blast2lca #86

Open avilella opened 1 year ago

avilella commented 1 year ago

Hi,

I've followed the methods section in this paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1299-7

To generate diamond blastx hits from short-read WGS data, and I wanted an equivalent to megan/tools/blast2lca which it seems taxonkit could be that equivalent:

diamond blastx -d/path/to/NCBI_nr/nr -q sample_name.fasta -a sample_name -p 16

MEGAN (v5.10.6) (obtained as described above) was used for read-level taxonomic classification in non-interactive mode:

megan/tools/blast2lca --input sample_name.daa --format BlastTAB --topPercent 10 --gi2taxa megan/GI_Tax_mapping/gi_taxid-March2015X.bin --output sample_name.read_assignments.txt

I have a sample_name.daa, and I wonder if there is a "preferred way" to end up with a list of lca's and the percentage of reads for it.

At the moment, I am doing a table join of the protein hits from diamond view with prot.accession2taxid.gz, and then I guess I can run taxonkit lca on that list of taxids? thx in advance

diamond view --daa AB2_1_A03_002_S5.daa | head -n 100000 > tmp
csvtk join -t -f "2;1" tmp /bfx_nfs/ncbi/nr/prot.accession2taxid.gz > tmp.tax 
taxonkit lca tmp.tax # ?
shenwei356 commented 1 year ago

Joining with prot.accession2taxid.gz is not a good idea, cause it's so big. Searching it with the 2nd column of tmp first could be a better way.

taxonkit lca can take input of taxids in each line (example).

avilella commented 1 year ago

I am running this on a machine with 376G of RAM, if there was a guarantee that the join wasn't going to go over it, given the usual size of prot.accession2taxid.gz, that would be fine by me. It seems to be relatively stable at 226G RAM usage on the table join above...

I couldn't make the diamond makedb below work for the nr database as a way of adding the taxids in the diamond blastx output, so I am stuck with finding an efficient way of getting tax ids from the output which look like the entries below:

diamond --version
diamond version 2.1.8

diamond makedb --in nr.gz -d diamondDB/ --taxonmap prot.accession2taxid.gz --taxonnodes taxdmp.zip
VH01324:45:AAC73CJHV:1:1101:20238:1000  KRX29278        79.3    29      6       0       92      6       1       29      1.28e-06        50.4
VH01324:45:AAC73CJHV:1:1101:20655:1000  SFW06984        87.8    49      6       0       149     3       13      61      2.08e-20        90.5
VH01324:45:AAC73CJHV:1:1101:20655:1000  EAA20390        83.7    49      8       0       150     4       41      89      1.13e-16        81.3
VH01324:45:AAC73CJHV:1:1101:25503:1000  EDL29934        85.7    42      6       0       151     26      21      62      1.46e-17        79.0
VH01324:45:AAC73CJHV:1:1101:25503:1000  KRY62511        77.8    45      10      0       151     17      5       49      3.49e-17        77.4
VH01324:45:AAC73CJHV:1:1101:25503:1000  EDL25189        88.4    43      3       1       151     29      34      76      7.52e-17        77.4
VH01324:45:AAC73CJHV:1:1101:25503:1000  XP_006517017    78.7    47      10      0       148     8       178     224     3.12e-16        79.7
VH01324:45:AAC73CJHV:1:1101:25503:1000  XP_006517014    78.7    47      10      0       148     8       286     332     1.04e-15        79.7
VH01324:45:AAC73CJHV:1:1101:25503:1000  EDL34859        83.7    43      7       0       151     23      64      106     3.22e-15        75.1
VH01324:45:AAC73CJHV:1:1101:25503:1000  EEQ62534        83.7    43      7       0       151     23      30      72      4.56e-15        72.8

any ideas would be welcomed.

shenwei356 commented 1 year ago
# extract taxids of hits
zcat prot.accession2taxid.gz \
    | csvtk grep -Ht -f 1 -P <(cut -f 2 data.tsv) \
    | cut -f 1,3 \
    > hit2taxid.tsv

# join and collect taxids for each query, and then  compute LCA
csvtk join -Ht -f '2;1' <(cut -f 1,2 data.tsv) hit2taxid.tsv \
    | csvtk fold -Ht -f 1 -v 3 -s ';' \
    | taxonkit lca -i 2 -s ';' \
    > query2lca.tsv