Linclust fails to cluster sequences with a single mismatch

AyushSaxena commented 3 years ago

I'm currently testing linclust using the easy-linclust workflow with very small datasets (100 / 1000 sequences). Each sequence is a nucleotide sequence, on average, 3000 basepairs long. So using 20 kmers, and an automatically generated kmer size (of 17) must find overlaps between these two sequences. And yet I see several sequence pairs that have 1-5 mismatches between them (0.1% error rate) that are missed and classified as single member clusters. They should not be disqualified by the min-seq-id or -c values that I'm using (of 0.99), which are stringent in the general case.

Is there anything I could use to increase sensitivity at this stage?

AnnSeidel commented 3 years ago

Hi, can you report the exact command you used and your MMseqs2 version. Would it also be possible to share your sequences?

AyushSaxena commented 3 years ago

Hi Ann,

Thank you for the quick response. I'm currently using - f349118312919c4fcc448f4595ca3b3a387018e2 - which was the latest version as of last month. The entire point of this exercise, for me, is to understand, algorithmically, why linclust-only misses clusters that it perhaps should cluster. The tool works great overall!

I'm using a fairly strict criterion for clustering using linclust, attempting to do as much as I can before I use mmseqs. The parameters are -> -e 0.1 --min-seq-id 0.99 --cov-mode 3 -c 0.99 --kmer-per-seq 30 --kmer-per-seq-scale 0.3 -k 50 --cluster-mode 1 --max-iterations 50000 --seq-id-mode 2 --threads 16

I have used comically large values of (e / kmer-per-seq) and I still find thousands of such pairs that should have been clustered.

Unfortunately, I cannot share the sequences because I don't own them. But I could report further that if I only use the two sequences (that are being missed in a population of sequences) on their own, linclust clusters them using the same pipeline. Next, I used the same two clusters with an 'outgroup' sequences, and I found the expected outcome - the two close by sequences were clustered whereas the outgroup was not. I begin to lose sensitivity as I start using more and more sequences in the population, which leads me to believe that the loss of sensitivity could be down the pre-filters/heuristics/e-value.

If I manually 'cascade' the representative sequences from the first step using linclust again (instead of mmseqs, hence 'cascade'), I don't gain any appreciable sensitivity. So when provided a large population, linclust is missing these pairs many times (I tried to 'cascade' several times, but using linclust alone.

What I find surprising is that these two sequences must share at least one kmer (in fact they must share many), so they must have been compared to each other, and they qualify all hard filters (coverage / alignment score), and yet they were missed. It's also possible both of them were compared to a third 'representative' sequence of a different cluster, and both of them did not make the cut. Still, they should have been compared to each other in a cluster of two because of their shared kmers.

Is their a temporary file that documents which kmers were selected from a particular sequence because I would like to get my hands dirty with that.

Thank you! Ayush

milot-mirdita commented 3 years ago

I am not sure we deal well with 50-mers, the default nucleotide k-mer size is 14 or 15 (depending on the database size). Also, we have predefined spaced-kmer patterns only up to 30-mers, after that no spaced-kmers are used anymore. Could you check if something more sensible happens if you leave out the k-mer size parameter?

We have commented out debug code interspersed in a lot of places to track down issues like that, however that's not very easy to use. Could you try to rule out the k-mer size issue first?

AyushSaxena commented 3 years ago

Hi Milot,

Thank you once again for the quick follow up. I've conducted a grid search to understand this issue myself, and I have used a kmer size of 10, with or without the spaced kmer mode previously, and found the same issue.

Here are the parameters used for that run - --kmer-per-seq-scale 0 --min-seq-id 0.99 --cov-mode 3 -c 0.99 --cluster-mode 2 --spaced-kmer-mode 0 -k 10 --threads 16 --adjust-kmer-len 0 --ignore-multi-kmer 0

In other attempts, I have not used the kmer size parameter, and the pipeline automatically chooses 17 in that case, and the issue persists.

In these pipelines with -c set to 0.99, and min-seq-id set to 0.99, I get excellent specificity (with a few or no clusters that have members that shouldn't be there), but about ~40% of all the representative sequences have another representative sequence in vicinity (Levenshtein distance of less than 10 out of 2500-3000 bp). This fraction drops to about ~15% if I use min-seq-id of 0.95, so I do find a gain in sensitivity. But 15% of all representative sequences is still quite a high number. What I don't understand is why those representative sequences weren't linked together with an edge when they check all boxes.

AnnSeidel commented 3 years ago

Hi Ayush,

your --cov-mode 3 parameter might also be an issue. If your sequences vary in the range of 2500-3000 bp we can expect that the sequence length of many potential cluster members are smaller than 99% of the sequence length of the cluster representative and are therefore not considered and probably form their own clusters. Do you mean by any chance that the target sequence has to be covered by 99% (--cov-mode 2)?

AyushSaxena commented 3 years ago

Hi Ann,

Thank you for suggesting that. I know that using --cov-mode 3 paired with -c 0.99 is quite strict. When we relaxed it, we were finding 'false positive' clusters, where the cluster head/representatives (the largest sequence in the cluster) had novel enough genetic information in it, that we would have preferred it to be it's own cluster (with a solitary member).

That being said, in my grid search, I have only explored varying -c values with a constant --cov-mode 3.

The sequence pairs that I am finding that should be clustered together (and aren't) have nearly identical lengths (+-1) with a few mismatches (<10). It is my understanding that those should have been clustered regardless of the value of --cov-mode, even with the stringent -c 0.99.

To give you another example, even when I use a very lenient clustering criterion (please note the values of min-seq-id, -c) - --kmer-per-seq-scale 0 --min-seq-id 0.80 --cov-mode 3 -c 0.80 --cluster-mode 1 --spaced-kmer-mode 0 -k 10 --threads 16 --adjust-kmer-len 0 --ignore-multi-kmer 0

I still find representative sequence pairs that have a levenshtein distance of 1 between them. Although now, I find very few such pairs (8 / 1378 all-v-all alignment of representative sequences with a levenshtein distance of < 10). I understand these could be edge cases, but I find them interesting, and I want to learn what linclust is doing by focussing on these 8 alone.

(Making a long post even longer!) I'm validating linclust clusters by generating an all-v-all levenshtein distance matrix of representative clusters. And an all-v-all levenshtein distance matrix of all members within a cluster. Because my understanding of linclust is not great, I'm treating it as a black box that I intend to understand.

AnnSeidel commented 3 years ago

okay, from this point I can only guess without having the sequences. Linclust peforms an ungapped alignment to pre-cluster in the first step and uses gapped alignments later. Is it the case that your sequences would also fullfill the seqid and coverage thresholds for the ungapped alignment already? If so, can you check for two of your sequences that should have be clustered together and have only very few mismatches between them (no gaps), if they are in the same set in the pref database (kmermatcher result) and pref_rescore1.x database (ungapped alignment result) in the temporary directory linclust creates?

AyushSaxena commented 3 years ago

Hi Ann,

Thank you for guiding me so far. I did a bit of digging around in the temporary files. I'll call the two close sequences with one mismatch between them seqA and seqB. What I found was that they were both found within a larger cluster in the pref database with seqC (cluster head). seqC is about 20% different to both seqA and seqB, but their ungapped alignment scores may not reflect that.

In the pref.tsv (createtsv pref) seqC seqB 5 0 seqC seqB 5 0

In the pref_rescore1.index file (pref_rescore1.1 - pref_rescore1.16 are in binary format, and I wasn't sure if that is the one you are referring to), I only see two rows that contain seqA and seqB

seqA seqA 100 0 seqB seqB 100 0

After the pipeline finishes, all three sequences seqA-C end up being separate clusters.

Also, I'm using 10000 unique sequences in the input, so should one expect 10000-choose-2 pairs in the pref database? The total number of rows in the pref database is 69737, which is 0.13% of the total number of pairs.

I had run linclust with these parameters in this run- -e 1 --min-seq-id 0.99 --cov-mode 3 -c 0.99 --kmer-per-seq 100 --kmer-per-seq-scale 0 -k 15 --cluster-mode 1 --max-iterations 50000 --seq-id-mode 2 --threads 16 --remove-tmp-files 0

soedinglab / MMseqs2

Linclust fails to cluster sequences with a single mismatch #489