sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

adjust default thresholding for prefetch and gather? #2360

Open ctb opened 1 year ago

ctb commented 1 year ago

based on some of the results for the new long-read benchmarking paper Portik et al. (link - not yet updated), @bluegenes and I were discussing ways to decrease the false positive rate for prefetch/gather when using very low thresholds.

right now, the "best" way to lower the threshold-bp is to set it to 0, then do filtering after. but this incurs really heavy penalties in terms of time/memory - e.g. for the nanopore Q20 on zymo, we end up matching 80% of the database with threshold-bp=0!

on slack, I wrote:

hot take is that we might want to provide an alternate thresholding mechanism that is number of hashes. 1 hash causes problems, 2 hashes should be much better. my intuition is that it should be ~irrespective of scaled factor unless scaled is very small (such that k-mers overlap).

this might also be important for virus/phage matches.

the UX is a bit hard to fathom - I feel like --threshold-bp provides some kind of useful intuition! - but I can see a few different options that might work -

bluegenes commented 1 year ago

+1 to the idea of having the default 2 or 3 hash filtering in place, unless disabled

ctb commented 1 year ago

my intuition suggests that, as long as the hashes are entirely independent (i.e. come from non-overlapping k-mers), this should be very reliable. @dkoslicki, we'll pass you the long-read paper results when they are posted to bioRxiv - I think there's some interesting implications for classification, and maybe even minimizer design, that might emerge from thinking this through.

taylorreiter commented 1 year ago

Oooh I also love this idea! Matches my intuition from digging around in charcoal results too, where we frequently have 2-3 matches from small contigs.

ctb commented 1 year ago

I put together some notebooks on detection sensitivity for 100bp reads and for 10kb reads.

As expected, genomes can be detected very sensitively at very, very low coverage.

I think it will be easy to show that detection of multiple independent k-mers for large k will be incredibly specific, too. But I haven't done that yet.

ctb commented 1 year ago

long read paper:

Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets: https://www.biorxiv.org/content/10.1101/2022.01.31.478527v2

Relevant Figure 4 showing that sourmash has weirdly good performance 😆 -

Screen Shot 2022-11-16 at 5 31 23 AM
ctb commented 1 year ago

theory: for optimal sensitivity/specificity tradeoff we want to choose the smallest k-mer size such that no one k-mer is expected to appear more than once per genome.

calculation of said limit is left as an exercise to the reader, but it should be as simple as:

int(math.log(avg_genome_size, 4) + 1)

might want to pick an odd k-mer size 🤔

dkoslicki commented 1 year ago

Just catching up on this

@bluegenes and I were discussing ways to decrease the false positive rate for prefetch/gather when using very low thresholds.

Funnily enough, a math PhD student and I have been working on this exact thing this past year! The basic idea boils down to: if you set an ANI threshold (say, 0.95) by which you decide if something "matches" to your reference, you can use the stats from Mahmudur and Tessa's paper to effectively answer the question "have I seen enough k-mers such that I can claim this organism is in the sample?" This boils down to a quantile regression, so relatively straightforward. We're in the final stages of validating it on real data (annoyingly, there are normalization issues since "relative abundance" can be calculated in many different ways), but happy to discuss further if you're interested! In essence, it calculates the --threshold-bp per reference genome, based on the individual reference genomes' sketch/k-mer stats.

theory: for optimal sensitivity/specificity tradeoff we want to choose the smallest k-mer size such that no one k-mer is expected to appear more than once per genome.

Assuming genomes are random strings, correct? Otherwise, repeat lengths are going to cause an issue.

ctb commented 1 year ago

Just catching up on this

@bluegenes and I were discussing ways to decrease the false positive rate for prefetch/gather when using very low thresholds.

Funnily enough, a math PhD student and I have been working on this exact thing this past year! The basic idea boils down to: if you set an ANI threshold (say, 0.95) by which you decide if something "matches" to your reference, you can use the stats from Mahmudur and Tessa's paper to effectively answer the question "have I seen enough k-mers such that I can claim this organism is in the sample?" This boils down to a quantile regression, so relatively straightforward. We're in the final stages of validating it on real data (annoyingly, there are normalization issues since "relative abundance" can be calculated in many different ways), but happy to discuss further if you're interested! In essence, it calculates the --threshold-bp per reference genome, based on the individual reference genomes' sketch/k-mer stats.

Awesome! I suspect we are coming at similar questions from very different directions - would be good to talk when you're ready, or I can send you stuff in a little bit!

theory: for optimal sensitivity/specificity tradeoff we want to choose the smallest k-mer size such that no one k-mer is expected to appear more than once per genome.

Assuming genomes are random strings, correct? Otherwise, repeat lengths are going to cause an issue.

Mmmh, I'm tempted to break this down in a slightly different way than I think you're thinking -

My overactive imagine is guessing that you're trying to tackle (2) and (3) at the same time?

Anyway, repeats only matter for (3), not for (2). :)

More soon. I appear to be fixated on this at the moment.

ctb commented 1 year ago

@dkoslicki you might be interested in this line of thinking too - unicity distances.

https://github.com/ctb/2022-sourmash-sens-spec/issues/1

ctb commented 1 year ago

one of my (our!) conclusions from digging into the reasons why sourmash performs well in the above paper is that even low thresholds for combinatorial collections of hashes are really effective. we're also coming to some really interesting taxonomic conclusions that I have to dive into before I can communicate them effectively and concisely.

but, a conversation on lab slack made me connect the question of prefetch/gather thresholding in general to the more specific question of what thresholding to use for decontamination with charcoal, and it is interesting to remember that we dropped single-hash / LCA-style methods completely from charcoal in favor of requiring 3 hashes to call the taxonomy of a contig. in brief, this is beacuse LCA-style methods were far too prone to false positives! I have to think about how to integrate that into my new understanding of things...

links:

anyway, there is no real big surprise here now, other than to say that taxonomies, genome contamination, and taxonomy quality issues loom large when you start looking at outliers in genome identification.

ctb commented 3 months ago

Answer to a question in the Element/Matrix group:

I have a question regarding the gather and search commands. I am using the default values for the threshold for gather and search, but it is not clear to me how the default values of 50kb and 0.08 were determined and how they relate to the confidence I can have that the top result is the correct match to the query signature. Will you please clarify this for me?

We have some discussion of the 'gather' threshold here, https://github.com/sourmash-bio/sourmash/issues/2360, which also relates to the threshold used for search --containment and prefetch (which are the same underneath). In brief, we see very few false positives at scaled=1000 and threshold-bp=3000 for these commands. Simulations, theory, and benchmarking all suggest that at k=31 you have essentially zero false positives; the biggest problem we find is aberrant similarities due to e.g. contamination in MAGs.

The 50kb thus relates more to sensitivity and speed issues - 50kb is very low for two bacterial genomes, or a metagenome <-> bacterial genome, since you will detect overlaps at a level of 1% of an E. coli genome (5 Mb / 50kb). You should raise it if you don't care to detect that small an overlap! Lowering it may be necessary for smaller genomes like viruses, but we haven't worked out the parameters for viruses yet.

Last but not least, lowering the threshold for gather makes it much slower.

So there's no inherent reason we chose 50kb, it just matched the domain of life we were exploring (bacteria/archaea) and was a good default for speed reasons.

In terms of Jaccard, there's no real reason for the default. You can set it to 0 if you like without impacting speed much, and there's no empirical reason not to. It's just a question of how small a similarity you want to detect. The two metrics you can use to assess specificity are (1) mash distance (for which you should read the mash paper!) or (2) ANI, which relates to the biology you want to study.

I also should say I don't find Jaccard itself to be very useful, in general. ANI is a much more intuitively useful measure and you can match to genomes at 90% ANI or above with k-mer based analysis tools like sourmash.

You might also be interested in YACHT, which provides a statistical framework for all of this - https://academic.oup.com/bioinformatics/article/40/2/btae047/7588873.

Whenever we've done any kind of detailed validation on a small scale, we've found that shared k-mers strongly imply shared nucleotide content; this is the basis of lots of software, including ganon and sylph. How much is significant depends on what question you're asking.

Hope that helps :). Apologies for all the words!