sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

debiasing FracMinHash - plans and progress #1798

Open ctb opened 2 years ago

ctb commented 2 years ago

In re this paper,

Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances

the question arose, is this being implemented in sourmash? see Twitter thread and @luizirber draft PR https://github.com/sourmash-bio/sourmash/pull/1270.

quoth @dkoslicki -

Planned. Should be really straightforward to implement, as sourmash already had some cardinality estimators built in, iirc

and @luizirber -

Plan was to avoid changing the sig format because it could be calculated from the MH data, instead of building a HLL during sketching and saving it too (or just the # of unique k-mers from the HLL)


also for grins, check out this paper:

The minimizer Jaccard estimator is biased and inconsistent

dkoslicki commented 2 years ago

@luizirber I recall you mentioning trying to estimate cardinality from the sketches themselves. However, in PR #1270 I see a bunch of HLL references. IMHO, the most straight-forward implementation would be to calculate the cardinality to high precision when forming the sketches, but I see how that would break backward compatibility.

Did you come up with a mathematically sound way to calculate cardinality from the sketches, or would you like @mahmudhera and I to take a swing at it? I.e. sketch_size/scale_factor is an intuitive estimate of true cardinality, but it's not immediately obvious if the variance is better/worse than HLL

dkoslicki commented 2 years ago

Ok, turns out easier than I expected: the sketch_size/scale_factor obeys a nice two-sided Chernoff bound: chernoff where |X|/s is the sketch_size/scale_factor estimator, \delta is the relative error, and |A| the true set cardinality. So you don't really run into issues unless the underlying cardinality |A| is small.

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements.

Interestingly, this approach will always be less space efficient than HLL. HLL experiences a relative error of something like 1.04/Sqrt(m) for a sketch of size m. For FracMinHash, you end up with a relative error of (at absolute minimum, often much worse) 2.07944/Sqrt(m) for a sketch of size m.

The interesting wrinkle is that in practice, we don't know |A| a priori. Besides the aforementioned backward compatibility issues, I would then lean towards HLL unless you would want to pass over the data twice: once to get a good estimate of |A| and use this to inform an intelligent choice of scale_factor given a desired relative error

bluegenes commented 2 years ago

This would be really helpful for #1788 -- I've currently implemented using n_unique_kmers = scaled * len(minhash).

dkoslicki commented 2 years ago

@bluegenes what would be helpful in this regard? Since this basically boils down to a binomial random variable, whatever you would like can be explicitly calculated. Eg. do you want confidence intervals around the number of unique kmers? Specification of scale factor given a guess about number of unique kmers? Bound on how well you can estimate the number of unique k-mers given a scale factor? etc.

Just let me know and I can send you a python class implementing what it is you'd like

bluegenes commented 2 years ago

Thanks @dkoslicki!

I guess my ideal situation would be implementing HLL cardinality estimation going forward so we could use it for ANI estimation, at least for new sketches. To avoid backward incompatibility, we could fall back to sketch estimation when necessary. But there are design, compatibility concerns at play, so I'll leave that up to @ctb! I mostly wanted to drop my branch here as motivation for any decisions/progress there.

If that's not an option, perhaps a bound on how much this k-mer estimation affects ANI estimation would be helpful?

bluegenes commented 2 years ago

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements.

@mahmudhera @dkoslicki since we don't have HLL yet, I think we need to alert the user when the scaling factor is likely to be insufficient for the dataset. So in the example above, if the set has fewer than ~4.4x10^6 elements, we could raise a warning during sketching (and later prevent the user from using this sketch for ANI estimation, ref #2003).

Would you be able to help me with a function for this?

dkoslicki commented 2 years ago

@bluegenes Without HLL, is there some way to get an estimate of the number of k-mers/elements? That would be required (eg. knowing there are fewer than 4.4x10^6) for the above formula to work. If you do have that on hand, I can get you a function that implements this formula. Is that what you are looking for?

bluegenes commented 2 years ago

@bluegenes Without HLL, is there some way to get an estimate of the number of k-mers/elements? That would be required (eg. knowing there are fewer than 4.4x10^6) for the above formula to work. If you do have that on hand, I can get you a function that implements this formula. Is that what you are looking for?

I've been estimating via num_hashes * scaled for all equations -- will that work? If yes, then that function would be great!

dkoslicki commented 2 years ago

I see: there is a bit of circular logic if I use num_hashes * scaled as the estimate of |A|, since that's the quantity we are trying to measure the variance of with the chernoff bounds. It should be ok in many cases, but I would want to make a note that this should be fixed after the HLL is implemented. Is HLL on the roadmap?

bluegenes commented 2 years ago

I see: there is a bit of circular logic if I use num_hashes * scaled as the estimate of |A|, since that's the quantity we are trying to measure the variance of with the chernoff bounds. It should be ok in many cases, but I would want to make a note that this should be fixed after the HLL is implemented. Is HLL on the roadmap?

tagging in @ctb!

ctb commented 2 years ago

I have to dig into how to best add such information into the signature format. So it's definitely not a next-week kind of change, but it is on my list.

After that, it's mostly a versioning challenge, and should be annoying and detail oriented, but not that hard ;).

dkoslicki commented 2 years ago

Gotcha! Then I'll throw together a function that has the desired behavior, and you/we can plop in the HLL estimate when it eventually gets implemented

dkoslicki commented 2 years ago

@bluegenes see PR #2031 for the function. Once you place it where it needs to go in the main text, I can add the de-bias term, or you can if you like. The idea will be to:

  1. Check if we can trust num_hashes * scale as an estimate of |A|
  2. Check if 1-(1-s)^|A| is sufficiently close to 1 and/or just divide by this value since it will likely be extremely close to 1 if |A| is large enough
dkoslicki commented 2 years ago

Oh yeah, and once @mahmudhera is available again, we will want to discuss where to put the probability of corner cases (eg. high ANI and high scale factor leading to sketches not being able to distinguish between inputs)

dkoslicki commented 2 years ago

@bluegenes Should we leave this open since the bias part hasn't been implemented yet? ala this comment

dkoslicki commented 2 years ago

I imagine that we would just need to change this line to have the bias term in it:

return self.count_common(other, downsample) / (len(self) * (1- (1-1/s)^|A|)

with the logic around that to make sure the estimate of |A| is ok. Do you want me to take a crack at that, or would you like to @bluegenes ?

bluegenes commented 2 years ago

ah, right, so something like the following --

if self.size_is_accurate(): 
    return self.count_common(other, downsample) / (len(self) * (1- (1-1/self.scaled)^(len(self)*self.scaled)))

@ctb should we not return containment if size is inaccurate? Or return containment but also notify users with a warning?

This is a pretty big change in terms of handling of very small sketches -- might affect the majority of our tests...

ctb commented 2 years ago

no idea! warning for sure.

dkoslicki commented 2 years ago

@bluegenes yup, that looks correct! And I think a warning makes the most sense too: if self.size_is_accurate() is false, then it just means the estimate has high variance, so could be inaccurate, though not guaranteed to be inaccurate.

bluegenes commented 2 years ago

2032 adds functionality to warn users about small / potentially inaccurate sketches and prevent ANI estimation from these.

However, adding the de-biasing term to our containment function would change our output for small sketches. For semantic versioning reasons it may be best to warn users for now, and make this change (/ HLL cardinality estimation) for the next major release. cc @dkoslicki @ctb