Closed frank-stonybrook closed 2 years ago
Indeed, by default the length of the sequences within the same cluster should be identical due to the Hamming distance criterion we apply. However, some degree of error may still be present due to the fact that we use a hashing script for identifying pairs of sequences that have HD <= 1. The hashing function takes all the odd and even positions of the sequence and stores them in a bucket. Exact HD computation will then be performed for each bucket, thereby drastically reducing the total amount of pairwise comparisons. Generally, this works very well, but is prone to small errors for very short sequences.
Can you check whether the clusters that contain sequences of different length indeed include short sequences?
Dear ClusTCR developer,
I am using ClusTCR's MCL method to cluster about 3k CDRH3 sequences and find out that there are a few clusters containing sequences with different length. Based on my understanding, the similarity metric, hamming distance, is only valid on a group of sequences with same length thus prohibit putting seqences with unequal length into one cluster. Am I missing something here? Thanks
Here is the sequence file: seq_3k.txt