Open avilella opened 1 year ago
Joining with prot.accession2taxid.gz
is not a good idea, cause it's so big. Searching it with the 2nd column of tmp
first could be a better way.
taxonkit lca
can take input of taxids in each line (example).
I am running this on a machine with 376G of RAM, if there was a guarantee that the join wasn't going to go over it, given the usual size of prot.accession2taxid.gz
, that would be fine by me. It seems to be relatively stable at 226G RAM usage on the table join above...
I couldn't make the diamond makedb
below work for the nr database as a way of adding the taxids in the diamond blastx
output, so I am stuck with finding an efficient way of getting tax ids from the output which look like the entries below:
diamond --version
diamond version 2.1.8
diamond makedb --in nr.gz -d diamondDB/ --taxonmap prot.accession2taxid.gz --taxonnodes taxdmp.zip
VH01324:45:AAC73CJHV:1:1101:20238:1000 KRX29278 79.3 29 6 0 92 6 1 29 1.28e-06 50.4
VH01324:45:AAC73CJHV:1:1101:20655:1000 SFW06984 87.8 49 6 0 149 3 13 61 2.08e-20 90.5
VH01324:45:AAC73CJHV:1:1101:20655:1000 EAA20390 83.7 49 8 0 150 4 41 89 1.13e-16 81.3
VH01324:45:AAC73CJHV:1:1101:25503:1000 EDL29934 85.7 42 6 0 151 26 21 62 1.46e-17 79.0
VH01324:45:AAC73CJHV:1:1101:25503:1000 KRY62511 77.8 45 10 0 151 17 5 49 3.49e-17 77.4
VH01324:45:AAC73CJHV:1:1101:25503:1000 EDL25189 88.4 43 3 1 151 29 34 76 7.52e-17 77.4
VH01324:45:AAC73CJHV:1:1101:25503:1000 XP_006517017 78.7 47 10 0 148 8 178 224 3.12e-16 79.7
VH01324:45:AAC73CJHV:1:1101:25503:1000 XP_006517014 78.7 47 10 0 148 8 286 332 1.04e-15 79.7
VH01324:45:AAC73CJHV:1:1101:25503:1000 EDL34859 83.7 43 7 0 151 23 64 106 3.22e-15 75.1
VH01324:45:AAC73CJHV:1:1101:25503:1000 EEQ62534 83.7 43 7 0 151 23 30 72 4.56e-15 72.8
any ideas would be welcomed.
# extract taxids of hits
zcat prot.accession2taxid.gz \
| csvtk grep -Ht -f 1 -P <(cut -f 2 data.tsv) \
| cut -f 1,3 \
> hit2taxid.tsv
# join and collect taxids for each query, and then compute LCA
csvtk join -Ht -f '2;1' <(cut -f 1,2 data.tsv) hit2taxid.tsv \
| csvtk fold -Ht -f 1 -v 3 -s ';' \
| taxonkit lca -i 2 -s ';' \
> query2lca.tsv
Hi,
I've followed the methods section in this paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1299-7
To generate
diamond blastx
hits from short-read WGS data, and I wanted an equivalent to megan/tools/blast2lca which it seems taxonkit could be that equivalent:I have a
sample_name.daa
, and I wonder if there is a "preferred way" to end up with a list of lca's and the percentage of reads for it.At the moment, I am doing a table join of the protein hits from
diamond view
with prot.accession2taxid.gz, and then I guess I can run taxonkit lca on that list of taxids? thx in advance