Closed apcamargo closed 1 year ago
Thanks for reporting this. I did not notice this for such a long time ...
Actually, the TAXPASTA error is due to rank not summing up to 100% (due to lineages genomes skipping some ranks).
Skipping ranks should not cause that. Can you attach a file?
Another way is using taxonkit cami-filter (without setting -t, --taxids
) to recompute the abundance in CAMI format, which is one of the input formats of taxpasta.
Sure! TAXPASTA filed in the step it checks the composition, specifically, at the part it checks if all taxa within a given rank sum up to 100%. I summed up the abundances manually and I saw that some ranks had summed abundances lower than that.
ERR7569999.metaphlan.txt ERR7569998.metaphlan.txt ERR7569997.metaphlan.txt
I see. Some ref genomes' lineages do not have all the 7 ranks, which is quiet normal I think. Maybe ask taxpasta to support this?
for r in k p c o f g s; do \
echo -n "$r ";
cat ERR7569997.metaphlan.txt \
| csvtk grep -H -r -p "${r}__[^\|]+$" \
| csvtk summary -Ht -f 3:sum; \
done
k 100.00
p 85.88
c 68.43
o 55.63
f 41.11
g 19.43
s 100.00
k__Bacteria|p__Bacillota|s__Firmicutes bacterium UBA1422 1947935 0.038769
k__Bacteria|p__Pseudomonadota|c__Betaproteobacteria|o__Burkholderiales|s__Burkholderiales bacterium 1891238 0.038604
The MetaPhlAn output generated by KMCP is not the same as the one generated by MetaPhlAn. In the KMCP output, the taxid column only contains the taxid of the lowest taxonomic rank (e.g.
1224
), while the one generated by MetaPhlAn contains the full lineage, separated by|
(e.g.2|1224
).This makes the KMCP output incompatible with TAXPASTA.Actually, the TAXPASTA error is due to rank not summing up to 100% (due to lineages genomes skipping some ranks).