sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

question about the output when using the sourmash gather #3242

Closed sapuizait closed 3 months ago

sapuizait commented 3 months ago

Hi again

Apologies if this has been asked before, but I couldn't find an answer that would satisfy my curiosity... I just would like to make sure I get the output of the sourmash gather mapping.

I read your nice comment here https://github.com/sourmash-bio/sourmash/issues/1289 And I now understand that (please correct what I have misinterpreted): p_match is the fraction of the reference genome covered (its what I call alignment coverage - how much of the reference sequence length is covered), p_query is the percentage of reads successfully mapped to the reference (this is a percentage that will only keep dropping because if reads have already been mapped there can only be less and less mapping - especially when it comes to mapping to similar reference genomes)? Now, is the "the recovered matches hit XX% of the query k-mers (unweighted)" the sum of the p_query? avg-abundance is sequencing depth? (how many kmers mapped to the reference) But, what about the abundance-weighted query? is that how many reference genomes gave positive matches? or is it related to the metagenome reads?

thanks in advance P

ctb commented 3 months ago

p_match is the fraction of the reference genome covered (its what I call alignment coverage - how much of the reference sequence length is covered),

yes!

p_query is the percentage of reads successfully mapped to the reference (this is a percentage that will only keep dropping because if reads have already been mapped there can only be less and less mapping - especially when it comes to mapping to similar reference genomes)?

yes! it matches to f_unique_weighted in the gather CSV output.

Now, is the "the recovered matches hit XX% of the query k-mers (unweighted)" the sum of the p_query?

no, the weighted percentage is the sum of the f_unique_weighted; the unweighted version that you're asking about has to do with the total number of distinct k-mers in the metagenome, not accounting for number of times we see them.

avg-abundance is sequencing depth? (how many kmers mapped to the reference)

Yes, it should correspond to the depth of the mapped reads at that rank in the gather result, i.e. after mapped reads to previous rank genomes are removed.

But, what about the abundance-weighted query? is that how many reference genomes gave positive matches? or is it related to the metagenome reads?

that's the sum of the f_unique_weighted; it corresponds to what fraction of the total metagenome reads will map to the genomes detected by the gather algorithm.

I hope that helps! I know it's complicated 😭

ctb commented 3 months ago

This section of the FAQ might be useful reading: https://sourmash.readthedocs.io/en/latest/faq.html#how-do-k-mer-based-analyses-compare-with-read-mapping

sapuizait commented 3 months ago

that is excellent - sourmash will be my go to pipeline from now on :) we just need a few more DBs

ctb commented 3 months ago

great! Please ask questions as you have them - I'll close this, but you can ask here, or start a new issue with new questions!