sidhomj / DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
https://sidhomj.github.io/DeepTCR/
MIT License
113 stars 40 forks source link

Optimization of the threshold parameter in hierarchical clustering #81

Open CharlineJnnt opened 1 year ago

CharlineJnnt commented 1 year ago

Hello @sidhomj,

I used the unsupervised partof DeepTCR to cluster TCR sequences, but when I allowed the method to determine the optimal threshold parameter with the following command line, I got this error:

DTCRU_test.Cluster(clustering_method="hierarchical", linkage_method="ward", criterion="distance", write_to_sheets=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/DeepTCR/DeepTCR.py", line 1054, in Cluster
    IDX = hierarchical_optimization(distances, features, method=linkage_method, criterion=criterion)
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/DeepTCR/functions/utils_u.py", line 52, in hierarchical_optimization
    sil.append(skmetrics.silhouette_score(features[sel, :], IDX[sel]))
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 118, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 229, in silhouette_samples
    check_number_of_labels(len(le.classes_), n_samples)
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 35, in check_number_of_labels
    % n_labels
ValueError: Number of labels is 2876. Valid values are 2 to n_samples - 1 (inclusive)

To correct this, I tried to modifiy the function _hierarchicaloptimization in the utils_u.py script in DeepTCR/functions folder (l.44):

def hierarchical_optimization(distances,features,method,criterion):
    Z = linkage(squareform(distances), method=method)
    t_list = np.arange(1, 100, 1) #t_list = np.arange(0, 100, 1)
    sil = []
    for t in t_list:
        IDX = fcluster(Z, t, criterion=criterion)
        if len(np.unique(IDX[IDX >= 0])) == 1:
            sil.append(0.0)
            continue
        sel = IDX >= 0
        sil.append(skmetrics.silhouette_score(features[sel, :], IDX[sel]))

    IDX = fcluster(Z, t_list[np.argmax(sil)], criterion=criterion)
    return IDX

and it works !

sidhomj commented 1 year ago

Thank you for contributing! I'll update the code!