gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
89 stars 21 forks source link

Feature request - add a statistical cutoff for cluster merging? #38

Open darachm opened 3 years ago

darachm commented 3 years ago

Hey folks, big fans of starcode. I really like this tool (well designed, Levenshtein distance, code is readable), and I'd like to use it for all my seq clustering needs. I hope y'all are continuing active development towards Starcode2 ... ?

I was reading the Bartender clustering paper ( doi.org/10.1093/bioinformatics/btx655 ), and their author-run benchmarks show their Bartender tool working better for low-abundance clustering. One main difference between the tools that might explain this appears to be the use of a "Z-statistic" for cluster-merging. I believe other tools use similar statistical cutoffs (DADA2?), and it's my impression that starcode just uses a elegantly simple ratio ( https://github.com/gui11aume/starcode/blob/c30de3d163d0d823a494961f33ce1c42a5b3ca1a/src/starcode.c#L847 ). I need my clustering to be able to handle low counts clustering confidently, so I had a few questions:

Please let me know if you folks have any ideas about this. And again, thanks for your work on this public resource.

gui11aume commented 3 years ago

Hi @darachm and thanks for your kind words. I apologize for the late reply, I am catching up with the post-COVID-19 mess only now.

We are not planning to release starcode 2 any time soon. We have some idea of what should be done, but chances are infinitesimal that we'll have the time to do it. If someone has the interest and steps forward, I would of course guide them through the improvements (mostly to make a trie that is more compact and more cache-friendly).

I am not surprised that starcode performs badly at low count. The arbitrary cut-off at 5 reads causes several artefacts that we have observed already. The reason for not fixing it is that we don't know how. More specifically, you have to assume some things about the problem at hand and we preferred to not do that because people try to solve different problems.

For instance, one person may have a very sparse set of barcodes with very high error rate, in which case it would make sense to merge a 2-count barcode with a 1-count barcode at distance 2. Another person may have a dense set of barcodes with very low error rate, in which case the above would not make sense at all.

The Levenshtein distance is not well modeled by any statistical distribution, and there are only few results about its approximate extreme values. Given simple models for insertions, deletions and substitutions, it is possible to get the quantiles of interest by simulation, so one could answer the question "what is the probability that these sequences are derived from such cluster / centroid?". The issue is that we would still not be able to answer the question "what is the probability that these sequences form their own centroid?" because we would need a model for the distribution of clusters (in line with the point above).

We could still work out some very generic model for the distribution of clusters (e.g., with the assumption that they are uniformly distributed in sequence space) and we could get rough estimates of the insertions, deletions and substitutions from singletons at distance 1 from a large cluster. That would get us somewhere, but I am still not sure that it is a good idea because we would make too many assumptions about the problem — for instance, we assume that all the differences between sequences are due to sequencing error, but what if they are also due to PCR errors and the user wants to correct those as well?

Let me know your thoughts, I am curious about your opinion.

darachm commented 3 years ago

I've been keeping this problem in the back of my mind for the last month, but since I was working on this today ( https://github.com/gui11aume/starcode/pull/40#issue-988612039 ), I thought I'd write.

You're very right that the next step would be to sketch this out with simulations. I suppose I (or anyone?) should do this first before coding it up. A toy model would help us understand this better.

I think there's two ways, one is to go simple and just copy the Z-statistic from Bartender. I think I have enough familiarity with the code base to now do this, sort of. The statistic is for Hamming distance, but if it works on simulated data then I think it'd just be good enough to keep low-counts over-merging under-control.

The other is to get more exact. The multiple error mechansisms inside Levenshtein makes it hard to do exactly, but maybe the answer is to use user-provided parameters to calculate a likelihood of observing the data given the merge, for each merge? I'm imagining the program would work sort of like:

I don't exactly how we'd calculate the likelihood, but it'd probably something simple like multiplying independent events, I think. I don't have any stats background, but if it matches simulations...

To generate per-sample error rates, it shouldn't be too hard to write a utility to analyze a fixed portion of the amplicon. This should be readily available for most applications of starcode (for DNAseq!), so this would take a sample of sequences that should be the same, compare them to the reference, and output the measured error rates for each mechanism.


To summarize, if I have some spare energy I think I might go about simulating some dense barcode error clouds and calculating some statistics between known-parent clusters.

Any thoughts, comments, or guidance?