Open humburg opened 7 years ago
I've seen more examples of this and at least in some cases, it is obvious that two different sequences are merged into the same cluster. In response to this, I've adapted the code for the initial similarity check to be more restrictive (see 63f2b44379). It now checks whether the prefixes of the two sequences (defaults to 10 characters as before) have more mismatches than would be expected in the entire sequence at a reasonable error rate, i.e. it now depends on the sequence length. For now I've set the threshold to 2% of the full sequence length but that may need some tweaking.
I noticed this consensus sequence:
It looks like the six sequences that contributed to it form two groups of four and two reads respectively. As far as I can tell from the diff, the sequences for these sub-groups are only about 50% identical. The sequence does have quite a lot of homopolymer runs of varying length, and many of the differences seem to fall at the edges of these. This is a fairly common error mode, so it is likely that these are actual errors (given that the UID was the same for all of these) but it may be good to guard against potential UID clashes a bit more comprehensively.
One option would be to abort merging a sequence into an existing cluster when it becomes clear that it deviates too much. Making use of #9 seems advisable but this would also require the ability to roll back potential changes to the consensus. Additionally, and as a quick initial fix, it would be good to flag these as potentially low quality sequences ( #4 also has some suggestion for this).