shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

suitable for CDS and/or contig taxonomic assignment? #28

Open AstrobioMike opened 1 year ago

AstrobioMike commented 1 year ago

Hey there, @shenwei356 :)

Thanks again for not only additional wonderful software, but excellent documentation as usual!

I'm looking for a better way to taxonomically classify predicted coding sequences and contigs from metagenomic assemblies (i currently use CAT with NCBI's nr).

I really want a combination of GTDB for bacteria/archaea, and then also be able to combine euks from NCBI, so your infrastructure enabling that sort of thing is really appealing to me šŸ™

I see in issue https://github.com/shenwei356/kmcp/issues/27 you note that KMCP is not suitable for long reads. Is your thinking similar for assembled contigs too?

And would you expect to have the same thoughts about taxonomically classifying predicted coding sequences (which might average around 800-1000 bases)?

If only one of those would be possible, I can imagine it might be reasonable to use it to infer the other. E.g., if contigs are do-able, then assigning all CDSs whatever their source contig tax was. And if CDSs are do-able, employing some consensus approach to assign to the contig the tax of its CDSs.

Sorry if i'm missing if you've covered this elsewhere already, and thanks for any of your thoughts!

shenwei356 commented 1 year ago

Hi Mike, thanks for your interest in KMCP.

You can try splitting CDS/contigs when using kmcp search.

in=e.coli.cds.fasta

# search against GTDB part1
seqkit sliding -g -s 100 -W 300 $in \
    | seqkit seq -m 100 \
    | kmcp search -d gtdb.part_1.kmcp/ -o $in.kmcp@gtdb.part_1.tsv.gz

# search against GTDB part2
seqkit sliding -g -s 100 -W 300 $in \
    | seqkit seq -m 100 \
    | kmcp search -d gtdb.part_2.kmcp/ -o $in.kmcp@gtdb.part_2.tsv.gz

# search against other databases

# merge search results
kmcp merge $in.kmcp@*.tsv.gz -o $in.kmcp.tsv.gz

# profiling and output binning/classification results
kmcp profile -X taxdump/ -T taxid.map -m 5 \
    $in.kmcp.tsv.gz -o $in.profile -B $in.binning.gz

# ------------------------------

# collect taxids for each orginal query.
zcat $in.binning.gz | grep -v '@' | grep -v '#' \
    | csvtk replace -Ht -p '_sliding:\d+-\d+$' \
    | csvtk fold -Ht -f 1 -v 2 -s , \
    > $in.id2taxids.tsv

# compute LCA
taxonkit lca --data-dir taxdump/ -i 2 -D -s , $in.id2taxids.tsv \
    | cut -f 1,3 \
    > $in.id2taxids.lca.tsv

# retrieve lineages and count
taxonkit reformat --data-dir taxdump -I 2 $in.id2taxids.lca.tsv \
    | csvtk freq -Ht -f 3 -nr | csvtk pretty -Ht

Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;                         2602
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli         1061
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia fergusonii   96
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. E1V33    71

stats

$ seqkit  stats e.coli.cds.fasta
file              format  type  num_seqs    sum_len  min_len  avg_len  max_len
e.coli.cds.fasta  FASTA   DNA      4,315  4,025,874       27      933    7,077

$ wc -l $in.id2taxids.tsv
3830 e.coli.cds.fasta.id2taxids.tsv

Another example of Klebsiella pneumoniae CDS:

taxonkit reformat --data-dir taxdump -I 2 $in.id2taxids.lca.tsv \
    | csvtk freq -Ht -f 3 -nr | csvtk pretty -Ht
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae   4723

file      format  type  num_seqs    sum_len  min_len  avg_len  max_len
kp.fasta  FASTA   DNA      5,316  4,696,377      114    883.4    9,492

4723 kp.fasta.id2taxids.tsv
AstrobioMike commented 1 year ago

Awesome, thanks for plotting out a clear path! I looks forward to trying it out :)