Open AstrobioMike opened 1 year ago
Hi Mike, thanks for your interest in KMCP.
You can try splitting CDS/contigs when using kmcp search
.
in=e.coli.cds.fasta
# search against GTDB part1
seqkit sliding -g -s 100 -W 300 $in \
| seqkit seq -m 100 \
| kmcp search -d gtdb.part_1.kmcp/ -o $in.kmcp@gtdb.part_1.tsv.gz
# search against GTDB part2
seqkit sliding -g -s 100 -W 300 $in \
| seqkit seq -m 100 \
| kmcp search -d gtdb.part_2.kmcp/ -o $in.kmcp@gtdb.part_2.tsv.gz
# search against other databases
# merge search results
kmcp merge $in.kmcp@*.tsv.gz -o $in.kmcp.tsv.gz
# profiling and output binning/classification results
kmcp profile -X taxdump/ -T taxid.map -m 5 \
$in.kmcp.tsv.gz -o $in.profile -B $in.binning.gz
# ------------------------------
# collect taxids for each orginal query.
zcat $in.binning.gz | grep -v '@' | grep -v '#' \
| csvtk replace -Ht -p '_sliding:\d+-\d+$' \
| csvtk fold -Ht -f 1 -v 2 -s , \
> $in.id2taxids.tsv
# compute LCA
taxonkit lca --data-dir taxdump/ -i 2 -D -s , $in.id2taxids.tsv \
| cut -f 1,3 \
> $in.id2taxids.lca.tsv
# retrieve lineages and count
taxonkit reformat --data-dir taxdump -I 2 $in.id2taxids.lca.tsv \
| csvtk freq -Ht -f 3 -nr | csvtk pretty -Ht
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia; 2602
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 1061
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia fergusonii 96
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. E1V33 71
stats
$ seqkit stats e.coli.cds.fasta
file format type num_seqs sum_len min_len avg_len max_len
e.coli.cds.fasta FASTA DNA 4,315 4,025,874 27 933 7,077
$ wc -l $in.id2taxids.tsv
3830 e.coli.cds.fasta.id2taxids.tsv
Another example of Klebsiella pneumoniae CDS:
taxonkit reformat --data-dir taxdump -I 2 $in.id2taxids.lca.tsv \
| csvtk freq -Ht -f 3 -nr | csvtk pretty -Ht
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae 4723
file format type num_seqs sum_len min_len avg_len max_len
kp.fasta FASTA DNA 5,316 4,696,377 114 883.4 9,492
4723 kp.fasta.id2taxids.tsv
Awesome, thanks for plotting out a clear path! I looks forward to trying it out :)
Hey there, @shenwei356 :)
Thanks again for not only additional wonderful software, but excellent documentation as usual!
I'm looking for a better way to taxonomically classify predicted coding sequences and contigs from metagenomic assemblies (i currently use CAT with NCBI's nr).
I really want a combination of GTDB for bacteria/archaea, and then also be able to combine euks from NCBI, so your infrastructure enabling that sort of thing is really appealing to me š
I see in issue https://github.com/shenwei356/kmcp/issues/27 you note that KMCP is not suitable for long reads. Is your thinking similar for assembled contigs too?
And would you expect to have the same thoughts about taxonomically classifying predicted coding sequences (which might average around 800-1000 bases)?
If only one of those would be possible, I can imagine it might be reasonable to use it to infer the other. E.g., if contigs are do-able, then assigning all CDSs whatever their source contig tax was. And if CDSs are do-able, employing some consensus approach to assign to the contig the tax of its CDSs.
Sorry if i'm missing if you've covered this elsewhere already, and thanks for any of your thoughts!