Feature request - add a statistical cutoff for cluster merging?

Hey folks, big fans of starcode. I really like this tool (well designed, Levenshtein distance, code is readable), and I'd like to use it for all my seq clustering needs. I hope y'all are continuing active development towards Starcode2 ... ?

I was reading the Bartender clustering paper ( doi.org/10.1093/bioinformatics/btx655 ), and their author-run benchmarks show their Bartender tool working better for low-abundance clustering. One main difference between the tools that might explain this appears to be the use of a "Z-statistic" for cluster-merging. I believe other tools use similar statistical cutoffs (DADA2?), and it's my impression that starcode just uses a elegantly simple ratio ( https://github.com/gui11aume/starcode/blob/c30de3d163d0d823a494961f33ce1c42a5b3ca1a/src/starcode.c#L847 ). I need my clustering to be able to handle low counts clustering confidently, so I had a few questions:

Do y'all think this difference in the low-count (<50) performance could be explained using a metric that scales for counting errors?
Does such a "Z-statistic" make sense when using Levenshtein distances? For a single error-mode (Hamming, as Bartender uses), you can easily calculate the mutational distance. For multiple error-modes of different frequencies (base-change, indel), you may need to (1) fit each parameter on fixed sequence from elsewhere in the reads and (2) enumerate all possible combinations of errors that can lead to a certain distance. Or, weigh the graph distances by likelihood of error. Does this make sense to try? What about a half-way approximation?
Are you interested in implementing an option to use such a "Z-statistic"? I was thinking of just hard-coding some tests on the above referenced line, does that sound like a reasonable way to test this idea out?

Please let me know if you folks have any ideas about this. And again, thanks for your work on this public resource.

Hi @darachm and thanks for your kind words. I apologize for the late reply, I am catching up with the post-COVID-19 mess only now.

We are not planning to release starcode 2 any time soon. We have some idea of what should be done, but chances are infinitesimal that we'll have the time to do it. If someone has the interest and steps forward, I would of course guide them through the improvements (mostly to make a trie that is more compact and more cache-friendly).

I am not surprised that starcode performs badly at low count. The arbitrary cut-off at 5 reads causes several artefacts that we have observed already. The reason for not fixing it is that we don't know how. More specifically, you have to assume some things about the problem at hand and we preferred to not do that because people try to solve different problems.

For instance, one person may have a very sparse set of barcodes with very high error rate, in which case it would make sense to merge a 2-count barcode with a 1-count barcode at distance 2. Another person may have a dense set of barcodes with very low error rate, in which case the above would not make sense at all.

The Levenshtein distance is not well modeled by any statistical distribution, and there are only few results about its approximate extreme values. Given simple models for insertions, deletions and substitutions, it is possible to get the quantiles of interest by simulation, so one could answer the question "what is the probability that these sequences are derived from such cluster / centroid?". The issue is that we would still not be able to answer the question "what is the probability that these sequences form their own centroid?" because we would need a model for the distribution of clusters (in line with the point above).

We could still work out some very generic model for the distribution of clusters (e.g., with the assumption that they are uniformly distributed in sequence space) and we could get rough estimates of the insertions, deletions and substitutions from singletons at distance 1 from a large cluster. That would get us somewhere, but I am still not sure that it is a good idea because we would make too many assumptions about the problem — for instance, we assume that all the differences between sequences are due to sequencing error, but what if they are also due to PCR errors and the user wants to correct those as well?

Let me know your thoughts, I am curious about your opinion.

I've been keeping this problem in the back of my mind for the last month, but since I was working on this today ( https://github.com/gui11aume/starcode/pull/40#issue-988612039 ), I thought I'd write.

You're very right that the next step would be to sketch this out with simulations. I suppose I (or anyone?) should do this first before coding it up. A toy model would help us understand this better.

I think there's two ways, one is to go simple and just copy the Z-statistic from Bartender. I think I have enough familiarity with the code base to now do this, sort of. The statistic is for Hamming distance, but if it works on simulated data then I think it'd just be good enough to keep low-counts over-merging under-control.

The other is to get more exact. The multiple error mechansisms inside Levenshtein makes it hard to do exactly, but maybe the answer is to use user-provided parameters to calculate a likelihood of observing the data given the merge, for each merge? I'm imagining the program would work sort of like:

Classic starcode and trie search (I don't understand this part) searches for pairs within a certain levenshtein distance, I assume this is most of the hard work
When it gets to the point of decided to merge each pair or not, the program re-calculates the distance to generate the number of mismatches, insertions, and deletions - I assume there are much fewer of these comparisons to make than for all-vs-all
The user has supplied parameters of:
- statistical threshold, likely a p-value of how likely the child's counts are given the other parameters and parent counts
- expected mismatch rate, expected insertion rate, expected deletion rate, or some summary of these (like just the error rate or rate of mismatch and of indel)
Based on the provided rates, starcode computes the likelihood of observing the child counts or more given the parent counts, and makes a decision based on the threshold.

I don't exactly how we'd calculate the likelihood, but it'd probably something simple like multiplying independent events, I think. I don't have any stats background, but if it matches simulations...

To generate per-sample error rates, it shouldn't be too hard to write a utility to analyze a fixed portion of the amplicon. This should be readily available for most applications of starcode (for DNAseq!), so this would take a sample of sequences that should be the same, compare them to the reference, and output the measured error rates for each mechanism.

To summarize, if I have some spare energy I think I might go about simulating some dense barcode error clouds and calculating some statistics between known-parent clusters.

Any thoughts, comments, or guidance?

gui11aume / starcode

Feature request - add a statistical cutoff for cluster merging? #38