matteodellamico / flexible-clustering

Clustering for arbitrary data and dissimilarity function
BSD 3-Clause "New" or "Revised" License
90 stars 16 forks source link

2x as many labels as there are input observations #6

Closed jolespin closed 6 months ago

jolespin commented 6 months ago

No version available but I'm using the current build which is commit: 96a2c3f

Here's the command I ran to benchmark:

sample_to_groundtruth = df_meta_samples["classification"]

benchmarking = defaultdict(dict)
for min_samples in [10,50,100,200,500]:
    for ef in [5,10,25,50,75, 100]:
        clusterer = flexible_clustering.FISHDBC(jaccard, min_samples=min_samples, ef=ef)
        for min_cluster_size in [10,50,100,200,500]:
            id = "min_samples={},ef={},min_cluster_size={}".format(min_samples, ef, min_cluster_size)
            for elem in sy.pv(df_features.values, id):
                clusterer.add(elem)
            labels, probs, stabilities, condensed_tree, slt, mst = clusterer.cluster(min_cluster_size=min_cluster_size)
            labels = pd.Series(labels, index=df_features.index)
            labels_filtered = labels[lambda x: x > -1]

            index = labels_filtered.index.intersection(sample_to_groundtruth.index)

            benchmarking[id]["min_samples"] = min_samples
            benchmarking[id]["ef"] = ef
            benchmarking[id]["homogeneity_score"] = homogeneity_score(sample_to_groundtruth.loc[index], labels_filtered.loc[index])
            benchmarking[id]["completeness_score"] = completeness_score(sample_to_groundtruth.loc[index], labels_filtered.loc[index])

df_benchmarking = pd.DataFrame(benchmarking).T

Here's the error:

min_samples=10,ef=5,min_cluster_size=10: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52325/52325 [02:13<00:00, 391.52it/s]
min_samples=10,ef=5,min_cluster_size=50: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52325/52325 [02:39<00:00, 328.12it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 12
     10     clusterer.add(elem)
     11 labels, probs, stabilities, condensed_tree, slt, mst = clusterer.cluster(min_cluster_size=min_cluster_size)
---> 12 labels = pd.Series(labels, index=df_genomic_traits.index)
     13 labels_filtered = labels[lambda x: x > -1]
     15 index = labels_filtered.index.intersection(sample_to_groundtruth.index)

File ~/miniconda3/envs/soothsayer_env/lib/python3.9/site-packages/pandas/core/series.py:575, in Series.__init__(self, data, index, dtype, name, copy, fastpath)
    573     index = default_index(len(data))
    574 elif is_list_like(data):
--> 575     com.require_length_match(data, index)
    577 # create/copy the manager
    578 if isinstance(data, (SingleBlockManager, SingleArrayManager)):

File ~/miniconda3/envs/soothsayer_env/lib/python3.9/site-packages/pandas/core/common.py:573, in require_length_match(data, index)
    569 """
    570 Check the length of data matches the length of the index.
    571 """
    572 if len(data) != len(index):
--> 573     raise ValueError(
    574         "Length of values "
    575         f"({len(data)}) "
    576         "does not match length of index "
    577         f"({len(index)})"
    578     )

ValueError: Length of values (104650) does not match length of index (52325)

Not sure why there are 2x more with these params.

jolespin commented 6 months ago

Note: The labels are not duplicates

labels.tolist()[:20]
[238,
 236,
 -1,
 160,
 236,
 160,
 -1,
 -1,
 308,
 -1,
 197,
 -1,
 -1,
 -1,
 312,
 312,
 -1,
 300,
 -1,
 -1]
jolespin commented 6 months ago

Diving a little deeper, this seems to only happen when I change min_cluster_size

matteodellamico commented 6 months ago

I think you're adding the same elements multiple times to the dataset: you should create a new clusterer in each inner for loop.

matteodellamico commented 6 months ago

It seems to me this is not a bug, but feel free to reopen if you think I'm wrong.

jolespin commented 6 months ago

Apologies! I hadn't realized that was outside of the for-loop. Thanks for catching this error on my part.