svalkiers / clusTCR

CDR3 clustering module providing a new method for fast and accurate clustering of large data sets of CDR3 amino acid sequences, and offering functionalities for downstream analysis of clustering results.
Other
48 stars 9 forks source link

Strange output #37

Closed deweihu96 closed 1 year ago

deweihu96 commented 2 years ago

Dear author,

I have 10000 distinct CDR3 sequences with the same length 15. I just run the codes with them like below:

import pandas as pd

from clustcr import Clustering
from clustcr import datasets

cdr3 = pd.read_csv('15.txt').iloc[:,0]

clustering = Clustering(use_gpu=True)

output = clustering.fit(cdr3,)

edges = output.export_network(filename='15_edgelist.txt')

output.write_to_csv('15_nodelist.txt')

However, the output file shows there are only 5 clusters. In each cluster, the difference between sequences is only one amino acid.

image
svalkiers commented 2 years ago

This is indeed strange behaviour. A result like this would imply that the sequences you are aiming to cluster are too distant from each other in terms of Hamming distance. If ClusTCR cannot detect sufficient pairs of sequences where HD = 1, the resulting network will be very small. This problem is especially apparent when working with small data sets of long sequences.

You can try to use the MCL method only, this may slightly improve your clustering results. In addition, we will consider more flexible solutions in future releases, where the allowed edit-distance is larger for longer sequences. I'll also gladly take a look at the problem in a little more detail if you could provide me with (a sample of) the data you are using.

svalkiers commented 1 year ago

Closing this issue due to inactivity.