shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

KMCP's MetaPhlAn output doesn't follow the MetaPhlAn file format #34

Closed apcamargo closed 9 months ago

apcamargo commented 1 year ago

The MetaPhlAn output generated by KMCP is not the same as the one generated by MetaPhlAn. In the KMCP output, the taxid column only contains the taxid of the lowest taxonomic rank (e.g. 1224), while the one generated by MetaPhlAn contains the full lineage, separated by | (e.g. 2|1224).

This makes the KMCP output incompatible with TAXPASTA. Actually, the TAXPASTA error is due to rank not summing up to 100% (due to lineages genomes skipping some ranks).

shenwei356 commented 1 year ago

Thanks for reporting this. I did not notice this for such a long time ...

Actually, the TAXPASTA error is due to rank not summing up to 100% (due to lineages genomes skipping some ranks).

Skipping ranks should not cause that. Can you attach a file?

Another way is using taxonkit cami-filter (without setting -t, --taxids) to recompute the abundance in CAMI format, which is one of the input formats of taxpasta.

apcamargo commented 1 year ago

Sure! TAXPASTA filed in the step it checks the composition, specifically, at the part it checks if all taxa within a given rank sum up to 100%. I summed up the abundances manually and I saw that some ranks had summed abundances lower than that.

ERR7569999.metaphlan.txt ERR7569998.metaphlan.txt ERR7569997.metaphlan.txt

shenwei356 commented 1 year ago

I see. Some ref genomes' lineages do not have all the 7 ranks, which is quiet normal I think. Maybe ask taxpasta to support this?

for r in k p c o f g s; do \
     echo -n "$r ";
     cat ERR7569997.metaphlan.txt  \
        | csvtk grep -H -r -p "${r}__[^\|]+$" \
        | csvtk summary -Ht -f 3:sum; \
done

k 100.00
p 85.88
c 68.43
o 55.63
f 41.11
g 19.43
s 100.00

k__Bacteria|p__Bacillota|s__Firmicutes bacterium UBA1422    1947935 0.038769    
k__Bacteria|p__Pseudomonadota|c__Betaproteobacteria|o__Burkholderiales|s__Burkholderiales bacterium 1891238 0.038604