Closed pwwang closed 2 years ago
With the latest version by pip install git+https://github.com/svalkiers/clusTCR.git
@ b6181181fa9bb3dd9bf875ebf3c711ba6930c664 :
>>> import clustcr as ct
>>> cdr3 = ct.datasets.test_cdr3()
>>> clustering = ct.Clustering()
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
CDR3 cluster
0 CASTPQGAYEQYF 0
1 CASTPTGAYEQYF 0
2 CASSLGQIEQYF 1
3 CASSLGQKEQYF 1
4 CASSLGQGEQYF 1
.. ... ...
789 CASSEGSQEVFF 237
790 CSARAGGGEAKNIQYF 238
791 CSARASGGEAKNIQYF 238
792 CASSDSGTDTQYF 239
793 CASSLSGTDTQYF 239
[794 rows x 2 columns]
>>> clustering = ct.Clustering(method="mcl")
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
CDR3 cluster
0 CASTPQGAYEQYF 0
1 CASTPTGAYEQYF 0
2 CASSLGQIEQYF 1
3 CASSLGQKEQYF 1
4 CASSLGQGEQYF 1
.. ... ...
789 CASSEGSQEVFF 237
790 CSARAGGGEAKNIQYF 238
791 CSARASGGEAKNIQYF 238
792 CASSDSGTDTQYF 239
793 CASSLSGTDTQYF 239
[794 rows x 2 columns]
>>> clustering.method
'MCL'
>>> clustering = ct.Clustering(method="faiss")
>>> clustering.method
'FAISS'
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
CDR3 cluster
0 CASSYLPGQGDHYSNQPQHF 0
1 CASSFEAGQGFFSNQPQHF 0
2 CASSFEPGQGFYSNQPQHF 0
3 CASSYEPGQVSHYSNQPQHF 0
4 CASSFGVEDEQYF 0
... ... ...
3387 CATSDVNGAYEQYF 0
3388 CSARGGSVFYEQYF 0
3389 CSARGGERFYEQYF 0
3390 CASSASTSDYSYEQYF 0
3391 CASSDLTGTAYNEQFF 0
[3392 rows x 2 columns]
faiss
method even resulted in all seqs being clustered to 0
.
>>> import importlib.metadata
>>> importlib.metadata.version("clustcr")
'0+untagged.267.gb618118'
Hi, thanks for using ClusTCR. I'll try to provide a comprehensive answer to any of your questions:
Where are the rest 2851 - 641 = 2210 sequences?
To answer your first question, ClusTCR takes into account all sequences, but not every sequence does belong to a cluster. This is an inherent result of the clustering procedure. In its second pass, ClusTCR builds a network where edges are drawn between sequences only if they differ 1 hamming distance (amino acid edit distance) at most. Thus, sequences that have no such connection will not be part of the network and therefore considered outliers. As such, they are not reported in the clustering results.
Also wired that different methods resulted in the same size of
clusters_df
.
The reason you see this results is that, when using the default parameters of ClusTCR, the two-step approach and MCL method will have identical result for small data sets. That is because the first pass, i.e. the faiss
-based clustering, will group the sequences into large 'superclusters'. You can define the size of the 'superclusters' by changing the faiss_cluster_size
parameter in the Clustering()
method. By default, this value is set to 5000. Since the number of sequences in the test data is < 5000, they will all be grouped into the same supercluster, on which the MCL approach will be applied. Consequently, the MCL approach will report identical results as the two-step method if the size of your data set is smaller than faiss_cluster_size
.
faiss
method even resulted in all seqs being clustered to 0.
See previous comment.
Hope this was helpful to you. If you have any questions, please don't hesitate to address them to me, I will gladly answer them.
All the best, Sebastiaan
WOW! Super clear explanation! Appreciate it! 👍
Where are the rest 2851 - 641 = 2210 sequences?
Also wired that different methods resulted in the same size of
clusters_df
.