svalkiers / clusTCR

CDR3 clustering module providing a new method for fast and accurate clustering of large data sets of CDR3 amino acid sequences, and offering functionalities for downstream analysis of clustering results.
Other
48 stars 9 forks source link

Retrieve the centroids #51

Open Ch-rode opened 1 year ago

Ch-rode commented 1 year ago

Hello ! Thanks for this amazing library. Is there a way to retrieve only the centroids for each cluster? Are they maybe the first sequence in each cluster (i.e. row 0 from cluster 0)? Thanks a lot image

svalkiers commented 1 year ago

Hi, thank you for using ClusTCR! The centroids are computed during the first step of the algorithm (i.e. the K-means), which uses a vectorized representation of each sequence to group them in Euclidean space. The centroids are initiated randomly in this space and their location is optimized throughout the various iterations of the algorithm. As such, they shouldn't be viewed as sequences, rather as vectors in the n-dimensional space (where n is equal to the number of features of each sequence).