Profiling output table interpretation

lborcard commented 1 year ago

Dear Shenwei,

Thank you very much for your very nice tool, we are trying to understand how to interpret the output table in KMCP format.

If the output table contains more than one ref per species based on which parameter should we choose the best hit?
According to your manual the percentage column refers to Relative abundance of the reference however, we are not sure how this value is calculated. Could you give us more details about this metric?

thank you very much,

best,

Loïc

shenwei356 commented 1 year ago

Thanks for using KMCP.

If the output table contains more than one ref per species based on which parameter should we choose the best hit?

The real genome in samples may match more than one reference, we can't tell which one is the truth. But the similarity score (column score, the 90th percentile of k-mer coverage of all uniquely matched reads) may be an index to show which one is more similar to the real genome.

According to your manual the percentage column refers to Relative abundance of the reference however, we are not sure how this value is calculated. Could you give us more details about this metric?

First, the coverage (column coverage) of each matched reference genome is computed by dividing the total bases of matched reads with the genome size (the total bases of either complete genome or unfinished genomes like MAGs with plasmid sequences filtered out). Then the relative abundance of one species is computed by dividing the sum of genome coverages of this species with the sum of genome coverages of all genomes. At last, the relative abundances of taxa at each rank are the sum of percentages of all the child taxa.

lborcard commented 1 year ago

thank you for the swift reply, if we have several refs with a score of 100 what would be the second metric to use to filter them? would coverage be a good one to use?

shenwei356 commented 1 year ago

I think so.

shenwei356 / kmcp

Profiling output table interpretation #22