sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
478 stars 79 forks source link

what to do about ANI estimate for two very small scaled sketches? #2003

Open bluegenes opened 2 years ago

bluegenes commented 2 years ago

With #1967, we will now estimate ANI for any scaled sketch comparisons, regardless of sketch size. These estimates may be inaccurate for viruses/small genomes.

context from https://github.com/sourmash-bio/sourmash/pull/1967#discussion_r859966317: @ctb:

A question that I didn't see clearly addressed anywhere (but I might have missed!) is what happens when you try to calculate ANI for two very small scaled sketches? Does it just end up being 0, or None, or ?

@bluegenes: At the moment, we just report the ANI, except for extremely tiny test data where we can't actually estimate ANI.

For jaccard --> ANI, we estimate the error on the jaccard estimate itself, and raise a warning when the error may be too high (but still currently report ANI). I have an item in the SearchResult class that keeps track of whether the jaccard estimate error is too high -- I think we should at least consider doing that, but would also be open to zeroing out the ANI estimate.

From https://github.com/sourmash-bio/sourmash/issues/1798#issuecomment-1015762466:

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements. Interestingly, this approach will always be less space efficient than HLL. HLL experiences a relative error of something like 1.04/Sqrt(m) for a sketch of size m. For FracMinHash, you end up with a relative error of (at absolute minimum, often much worse) 2.07944/Sqrt(m) for a sketch of size m.

I was hoping we might be able to use HLL to avoid issues with small sketches, but I suppose instead we could use this to estimate an error based on the sketch size, and raise a warning when the error/ zero out the ANI when the error is too high?

ctb commented 2 years ago

Hot take: Warning + None-ing it out as in #2004 seems good for now.

bluegenes commented 2 years ago

Handled by #2032.

bluegenes commented 2 years ago

This is happening much more often than I expected, and for some applications (prefetch, gather), can yield very verbose output (#2058).

Can deal with verbosity by changing the warning strategy, but I'm not sure zeroing out is the right call, if it's happening this often...

bluegenes commented 2 years ago

thresholds modified in #2074