Repeats can cause underclustering

This is an interesting issue (but broke me for one day).

Consider the following dummy alignment:

>1
TTTTTTTGGGGGGGAAAAAAATTTTTTT-------AAAAAAATTTTTTTAAAAAAA-------
>2
-------TTTTTTTAAAAAAATTTTTTTGGGGGGGAAAAAAATTTTTTT-------AAAAAAA
>3
TTTTTTTAAAAAAATTTTTTTAAAAAAATTTTTTT-------GGGGGGG-------AAAAAAA

If you take the Hamming distance between any pair, it's high, but if you count all the 7-mers for each, the kmer counts are identical. (They are: {'TTTTTTT': 3.0, 'TTTTTTG': 1.0, 'TTTTTGG': 1.0, 'TTTTGGG': 1.0, 'TTTGGGG': 1.0, 'TTGGGGG': 1.0, 'TGGGGGG': 1.0, 'GGGGGGG': 1.0, 'GGGGGGA': 1.0, 'GGGGGAA': 1.0, 'GGGGAAA': 1.0, 'GGGAAAA': 1.0, 'GGAAAAA': 1.0, 'GAAAAAA': 1.0, 'AAAAAAA': 3.0, 'AAAAAAT': 2.0, 'AAAAATT': 2.0, 'AAAATTT': 2.0, 'AAATTTT': 2.0, 'AATTTTT': 2.0, 'ATTTTTT': 2.0, 'TTTTTTA': 2.0, 'TTTTTAA': 2.0, 'TTTTAAA': 2.0, 'TTTAAAA': 2.0, 'TTAAAAA': 2.0, 'TAAAAAA': 2.0})

The alignment wouldn't actually look like this, it's the best simple example I could come up with. In practice, I saw this happen in MSAs of 5,800 sequences of 23kb each around EBA175 gene in P falciparum. Ie, looking at some repeat-rich (Plasmo ATs) subalignments the algorithm says they are not 'one-reference-like', so asks kmeans to cluster with increasing number of clusters; but kmeans when looking at the kmers, finds a smaller number of distinct data points due to same kmer counts, and cannot even produce that many clusters.

This leads to empty clusters, due to another code bug which let empty clusters be produced in the above case, and then prg construction fails.

iqbal-lab-org / make_prg

Repeats can cause underclustering #24