Sequences loss after clustering?

pwwang commented 2 years ago

>>> import clustcr as ct
>>> cdr3 = ct.datasets.test_cdr3()
>>> cdr3.size
2851
>>> cdr3.unique().size
2851
>>> clustering = ct.Clustering()
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3  cluster
0     CASSLGQGHYNEQFF        0
1     CASSPGQGHYNEQFF        0
2     CASSSGTGPNEKLFF        1
3     CASTSGTGPNEKLFF        1
4     CASSPGTAPNEKLFF        1
..                ...      ...
637    CASSLQGSNQPQHF      199
638     CASSDSGTDTQYF      200
639     CASSLSGTDTQYF      200
640  CSARAGGGEAKNIQYF      201
641  CSARASGGEAKNIQYF      201

[642 rows x 2 columns]
>>> sum(len(seqs) for seqs in output.cluster_contents())
642

Where are the rest 2851 - 641 = 2210 sequences?

>>> clustering = ct.Clustering(method="mcl")
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3  cluster
0     CASSLGQGHYNEQFF        0
1     CASSPGQGHYNEQFF        0
2     CASSSGTGPNEKLFF        1
3     CASTSGTGPNEKLFF        1
4     CASSPGTAPNEKLFF        1
..                ...      ...
637    CASSLQGSNQPQHF      199
638     CASSDSGTDTQYF      200
639     CASSLSGTDTQYF      200
640  CSARAGGGEAKNIQYF      201
641  CSARASGGEAKNIQYF      201

[642 rows x 2 columns]

>>> clustering = ct.Clustering(method="two-step")
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3  cluster
0     CASSLGQGHYNEQFF        0
1     CASSPGQGHYNEQFF        0
2     CASSSGTGPNEKLFF        1
3     CASTSGTGPNEKLFF        1
4     CASSPGTAPNEKLFF        1
..                ...      ...
637    CASSLQGSNQPQHF      199
638     CASSDSGTDTQYF      200
639     CASSLSGTDTQYF      200
640  CSARAGGGEAKNIQYF      201
641  CSARASGGEAKNIQYF      201

[642 rows x 2 columns]

Also wired that different methods resulted in the same size of clusters_df.

>>> import importlib.metadata
>>> importlib.metadata.version("clustcr")
'0+untagged.115.gba1ad3c'

pwwang commented 2 years ago

With the latest version by pip install git+https://github.com/svalkiers/clusTCR.git @ b6181181fa9bb3dd9bf875ebf3c711ba6930c664 :

>>> import clustcr as ct
>>> cdr3 = ct.datasets.test_cdr3()
>>> clustering = ct.Clustering()
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3 cluster
0       CASTPQGAYEQYF       0
1       CASTPTGAYEQYF       0
2        CASSLGQIEQYF       1
3        CASSLGQKEQYF       1
4        CASSLGQGEQYF       1
..                ...     ...
789      CASSEGSQEVFF     237
790  CSARAGGGEAKNIQYF     238
791  CSARASGGEAKNIQYF     238
792     CASSDSGTDTQYF     239
793     CASSLSGTDTQYF     239

[794 rows x 2 columns]

>>> clustering = ct.Clustering(method="mcl")
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                 CDR3  cluster
0       CASTPQGAYEQYF        0
1       CASTPTGAYEQYF        0
2        CASSLGQIEQYF        1
3        CASSLGQKEQYF        1
4        CASSLGQGEQYF        1
..                ...      ...
789      CASSEGSQEVFF      237
790  CSARAGGGEAKNIQYF      238
791  CSARASGGEAKNIQYF      238
792     CASSDSGTDTQYF      239
793     CASSLSGTDTQYF      239

[794 rows x 2 columns]

>>> clustering.method
'MCL'
>>> clustering = ct.Clustering(method="faiss")
>>> clustering.method
'FAISS'
>>> output = clustering.fit(cdr3)
>>> output.clusters_df
                      CDR3  cluster
0     CASSYLPGQGDHYSNQPQHF        0
1      CASSFEAGQGFFSNQPQHF        0
2      CASSFEPGQGFYSNQPQHF        0
3     CASSYEPGQVSHYSNQPQHF        0
4            CASSFGVEDEQYF        0
...                    ...      ...
3387        CATSDVNGAYEQYF        0
3388        CSARGGSVFYEQYF        0
3389        CSARGGERFYEQYF        0
3390      CASSASTSDYSYEQYF        0
3391      CASSDLTGTAYNEQFF        0

[3392 rows x 2 columns]

faiss method even resulted in all seqs being clustered to 0.

>>> import importlib.metadata
>>> importlib.metadata.version("clustcr")
'0+untagged.267.gb618118'

svalkiers commented 2 years ago

Hi, thanks for using ClusTCR. I'll try to provide a comprehensive answer to any of your questions:

Where are the rest 2851 - 641 = 2210 sequences?

To answer your first question, ClusTCR takes into account all sequences, but not every sequence does belong to a cluster. This is an inherent result of the clustering procedure. In its second pass, ClusTCR builds a network where edges are drawn between sequences only if they differ 1 hamming distance (amino acid edit distance) at most. Thus, sequences that have no such connection will not be part of the network and therefore considered outliers. As such, they are not reported in the clustering results.

Also wired that different methods resulted in the same size of clusters_df.

The reason you see this results is that, when using the default parameters of ClusTCR, the two-step approach and MCL method will have identical result for small data sets. That is because the first pass, i.e. the faiss-based clustering, will group the sequences into large 'superclusters'. You can define the size of the 'superclusters' by changing the faiss_cluster_size parameter in the Clustering() method. By default, this value is set to 5000. Since the number of sequences in the test data is < 5000, they will all be grouped into the same supercluster, on which the MCL approach will be applied. Consequently, the MCL approach will report identical results as the two-step method if the size of your data set is smaller than faiss_cluster_size.

faiss method even resulted in all seqs being clustered to 0.

See previous comment.

Hope this was helpful to you. If you have any questions, please don't hesitate to address them to me, I will gladly answer them.

All the best, Sebastiaan

pwwang commented 2 years ago

WOW! Super clear explanation! Appreciate it! 👍

svalkiers / clusTCR

Sequences loss after clustering? #32