joshlk / k-means-constrained

K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained
https://github.com/joshlk/k-means-constrained
BSD 3-Clause "New" or "Revised" License
192 stars 43 forks source link

[BUG] IndexError: index 10000 is out of bounds for axis 0 with size 10000 #55

Open cgr71ii opened 5 months ago

cgr71ii commented 5 months ago

Describe the bug Hi,

I have a code where, non-deterministically, eventually, I get an IndexError after calling to the method fit. I've also applied the change made in k_means_constrained/k_meansconstrained.py (in the python version 0.7.3 is not applied) but after that change the issue is still raising. Traceback:

"""
Traceback (most recent call last):
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/k_means_constrained/k_means_constrained_.py", line 303, in kmeans_constrained_single
    centers = _init_centroids(X, n_clusters, init, random_state=random_state, x_squared_norms=x_squared_norms)
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/k_means_constrained/sklearn_import/cluster/k_means_.py", line 311, in _init_centroids
    centers = _k_init(X, k, random_state=random_state,
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/k_means_constrained/sklearn_import/cluster/k_means_.py", line 103, in _k_init
    X[candidate_ids], X, Y_norm_squared=x_squared_norms, squared=True)
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/numpy/core/memmap.py", line 335, in __getitem__
    res = super().__getitem__(index)
IndexError: index 10000 is out of bounds for axis 0 with size 10000
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cgarcia/Documents/GenRet/run3.py", line 1418, in <module>
    main()
  File "/home/cgarcia/Documents/GenRet/run3.py", line 1380, in main
    test_dr(config)
  File "/home/cgarcia/Documents/GenRet/run3.py", line 1260, in test_dr
    do_epoch_encode(model, data, corpus, ids, tokenizer, batch_size, save_path, epoch, n_code=code_num)
  File "/home/cgarcia/Documents/GenRet/run3.py", line 1204, in do_epoch_encode
    raise e
  File "/home/cgarcia/Documents/GenRet/run3.py", line 1199, in do_epoch_encode
    centroids, code = constrained_km(normed_collection, nc)
  File "/home/cgarcia/Documents/GenRet/run3.py", line 1285, in constrained_km
    clf.fit(data)
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/k_means_constrained/k_means_constrained_.py", line 645, in fit
    k_means_constrained(
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/k_means_constrained/k_means_constrained_.py", line 192, in k_means_constrained
    results = Parallel(n_jobs=n_jobs, verbose=0)(
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/parallel.py", line 2007, in __call__
    return output if self.return_generator else list(output)
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/parallel.py", line 1650, in _get_outputs
    yield from self._retrieve()
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/parallel.py", line 1754, in _retrieve
    self._raise_error_fast()
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/parallel.py", line 745, in get_result
    return self._return_or_raise()
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/joblib/parallel.py", line 763, in _return_or_raise
    raise self._result
IndexError: index 10000 is out of bounds for axis 0 with size 10000

The code I'm running is https://github.com/sunnweiwei/GenRet/blob/dd252d1f3f8b3c16ff50bbc952813f2c9afbecc8/run.py#L1100

369.pt.error.normed_collection.tar.gz

Minimum working example

import pickle
from k_means_constrained import KMeansConstrained

data = pickle.load(open("./369.pt.error.normed_collection", 'rb')) # The input file that has failed in one of my runs

print(type(data)) # <class 'numpy.ndarray'>
print(data.shape) # (10000, 512)

n_clusters=512
size_min = min(len(data) // (n_clusters * 2), n_clusters // 4) # min(9, 256) = 9
size_max = n_clusters * 2 # 1024
clf = KMeansConstrained(n_clusters=n_clusters, size_min=size_min, size_max=size_max, max_iter=10, n_init=10, n_jobs=4, verbose=False)

clf.fit(data) # IndexError (not always). I've tried to replicate the error on my own but I couldn't... It just raises sometimes...

Versions: