svalkiers / clusTCR

CDR3 clustering module providing a new method for fast and accurate clustering of large data sets of CDR3 amino acid sequences, and offering functionalities for downstream analysis of clustering results.
Other
48 stars 9 forks source link

Unequal-length sequences being clustered together #38

Closed frank-stonybrook closed 2 years ago

frank-stonybrook commented 2 years ago

Dear ClusTCR developer,

I am using ClusTCR's MCL method to cluster about 3k CDRH3 sequences and find out that there are a few clusters containing sequences with different length. Based on my understanding, the similarity metric, hamming distance, is only valid on a group of sequences with same length thus prohibit putting seqences with unequal length into one cluster. Am I missing something here? Thanks

df_3k = pd.read_csv("seq_3k.txt")
clustering = Clustering(n_cpus=3,method="mcl")
output = clustering.fit(df_3k["CDRH3_sequence"])
df_seq = output.clusters_df
c_idx_set = set(df_seq.cluster)
abnormal = []
for c_idx in c_idx_set:
    if len(set(list(map(len,df_seq[df_seq.cluster == c_idx]["CDR3"])))) != 1:
        abnormal.append(c_idx)
print("abnormal cluster index:",abnormal)

Here is the sequence file: seq_3k.txt

svalkiers commented 2 years ago

Indeed, by default the length of the sequences within the same cluster should be identical due to the Hamming distance criterion we apply. However, some degree of error may still be present due to the fact that we use a hashing script for identifying pairs of sequences that have HD <= 1. The hashing function takes all the odd and even positions of the sequence and stores them in a bucket. Exact HD computation will then be performed for each bucket, thereby drastically reducing the total amount of pairwise comparisons. Generally, this works very well, but is prone to small errors for very short sequences.

Can you check whether the clusters that contain sequences of different length indeed include short sequences?