scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

pypi version throws ValueError #607

Open FinnHuelsbusch opened 1 year ago

FinnHuelsbusch commented 1 year ago

To reproduce the bug:

  1. Create a new python 3.11.x environment (tested with python 3.11.4)
  2. install the following dependencies:
    • scipy 1.11.1
    • scikit-learn 1.3.0
    • cython 0.29.36
    • hdbscan 0.8.33
  3. create a minimal example:
    from sklearn.datasets import make_blobs
    import hdbscan
    blobs, labels = make_blobs(n_samples=2000, n_features=10)
    clusterer = hdbscan.HDBSCAN()
    clusterer.fit(blobs)
    print(clusterer.labels_)
  4. Execute it and get the following error:
    Traceback (most recent call last):
    File "/home/***/Desktop/hdbscan_test.py", line 5, in <module>
    clusterer.fit(blobs)
    File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 1205, in fit
    ) = hdbscan(clean_data, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 884, in hdbscan
    _tree_to_labels(
    File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 80, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(
                                         ^^^^^^^^^^^^^
    File "hdbscan/_hdbscan_tree.pyx", line 659, in hdbscan._hdbscan_tree.get_clusters
    File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
    TypeError: 'numpy.float64' object cannot be interpreted as an integer

Workaround:

  1. Clone the repo
  2. uninstall hdbscan from the environment
  3. execute python setup.py install while the environment is active
  4. Execute the minimal example again.
  5. It work's

This was also tested with the commit 813636b2eda63739c9fc081f2ef78ad4c98444a1 (The commit of version 0.8.33)

Would be nice to get instructions on how to fix this (if the error is on my side) or to fix this in general.

Tested on Windows and Linux. This error only occurs under python 3.11.x.

FinnHuelsbusch commented 1 year ago

The error message seems similar to an error mentioned in #600 in the comments and its fix in #602. Though both are talking about the condense_tree function.

empowerVictor commented 1 year ago

I have the same error, both the 0.8.29 and 0.8.33

LoveFishoO commented 1 year ago

Absolutely, my version of python is also 3.11.x. I have the same error, but after I try this method, I get anthor error ModuleNotFoundError: No module named 'hdbscan._hdbscan_linkage'

Try python setup.py develop to replace python setup.py install I solve this problem.

FinnHuelsbusch commented 1 year ago

Maybe #606 helps with this error.

jkmackie commented 1 year ago

I also replicated the bug on Windows. Packages installed with pypi. Base virtual environment created with miniconda.

Bug occurs:

from sklearn.datasets import make_blobs
import hdbscan
blobs, labels = make_blobs(n_samples=2000, n_features=10)
clusterer = hdbscan.HDBSCAN()
clusterer.fit(blobs)
print(clusterer.labels_)

Error:

File hdbscan\\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()

TypeError: 'numpy.float64' object cannot be interpreted as an integer

Avoid the bug by switching to slower Python 3.10.x and downgrading scikit-learn. Keep the hdbscan and numpy versions.

No errors:

Revised 15 August, 2023

RichieHakim commented 1 year ago

I am also getting this error on windows builds. This seems like a pretty urgent issue. @lmcinnes or @gclendenning, forgive the @, but you may want to take a look at this.

johnlees commented 1 year ago

So this line: https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L733

is_cluster = {cluster: True for cluster in node_list}

node_list is constructed above:

    if allow_single_cluster:
        node_list = sorted(stability.keys(), reverse=True)
    else:
        node_list = sorted(stability.keys(), reverse=True)[:-1]
        # (exclude root)

and stability is from https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L164, see return https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L237-L241

    result_pre_dict = np.vstack((np.arange(smallest_cluster,
                                           condensed_tree['parent'].max() + 1),
                                 result_arr)).T

    return dict(result_pre_dict)

np.arange should have an integer dtype I think; result_arr has type dtype=np.double.

I am not sure if the np.vstack might be casting the the integer keys to floats due to the result_arr type (I might check this later), can't see anything obvious in numpy which would have changed this behaviour

JanElbertMDavid commented 1 year ago

@jkmackie thanks for the solution mate! appreciate it.

lmcinnes commented 1 year ago

At least some of the issues seem to be related to the wheel built for windows (and python 3.11). I have deleted that from PyPI. The downside is that installing on windows will require you to build from source; the upside is that hopefully installing from PyPI might work now.

johnlees commented 1 year ago

Just to confirm, I am also seeing this on an Ubuntu 22.04 CI with:

johnlees commented 1 year ago
b .../lib/python3.10/site-packages/hdbscan/hdbscan_.py:80
p stability_dict.keys()
dict_keys([378.0, 379.0, 380.0, 381.0, 382.0, 383.0, 384.0, 385.0, 386.0, 387.0, 388.0, 389.0, 390.0, 391.0, 392.0, 393.0, 394.0])

not sure if those being floats is the problem here

jkmackie commented 1 year ago

@johnlees I suspect downgrading scikit-learn below 1.3 would fix on Ubuntu. Numpy 1.22.4 is used in the successful Windows configuration below:

#

#Successful configuration - Windows 10.

(myvirtualenv) 
me@mypc MINGW64 ~/embedding_clustering
$ conda list | grep -w '^python\s\|scikit\|hdbscan\|numpy'
hdbscan                   0.8.33                   pypi_0    pypi
numpy                     1.24.4                   pypi_0    pypi
python                    3.10.9          h4de0772_0_cpython    conda-forge
scikit-learn              1.2.1                    pypi_0    pypi

Note hdbscan is imported separately from scikit-learn. I wonder why it isn't imported as a module like KMeans?

#from package.subpackage import module
from sklearn.cluster import KMeans

#in contrast, hdbscan cluster algo is imported directly
import hdbscan
johnlees commented 1 year ago

Same issue with scikit-learn 1.2.2 and 1.2.1, and other packages as above. I'm guessing this is a cython issue with the pyx files?

lmcinnes commented 1 year ago

This is really quirky, and I am having a great deal of trouble reproducing it in a way that I can actually debug it myself.

RichieHakim commented 1 year ago

Removing the pre-built wheel for windows on pypi was sufficient to get it working on my github actions windows runners.

If it is helpful, here is an example of when it was failing: https://github.com/RichieHakim/ROICaT/actions/runs/5861440405/job/15891513454

Thank you for the quick fix.

alxfgh commented 1 year ago

Removing the pre-built wheels and building from source didn't solve the bug for me

jkmackie commented 1 year ago

Removing the pre-built wheels and building from source didn't solve the bug for me

Did you try a fresh environment?

conda create -n testenv python=3.11

pip install hdbscan==0.8.33 numpy==1.24.4 notebook==7.0.2 scikit-learn==1.3.0

Cython should be something like 0.29.26 not 3.0.

If there's a hdbscan error, try:

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan
johnlees commented 1 year ago

This is really quirky, and I am having a great deal of trouble reproducing it in a way that I can actually debug it myself.

Likewise – doing the install from source (rebuilding the cython generated .so libraries) makes the issue go away. I have floats in the line reported by the backtrace, and am not sure that's the correct erroring line anyway. I might try rebuilding the conda-forge version and see if that helps

lmcinnes commented 1 year ago

We have a new azure-pipelines CI system that will automatically build wheels and publish them to PyPI thanks to @gclendenning, so hopefully the next time we make a release this will all work a little better. It is definitely just quirks on exactly how things build on different platforms etc. but the fine details of that are ... hard to sort out.

johnlees commented 1 year ago

Ah maybe I should have been clearer, I am having issues with the conda version, not pypi. The rebuild on conda-forge didn't sort out the CI issue unfortunately, still the same error.

lmcinnes commented 1 year ago

The conda forge recipe might need to be changed. Potentially adding a version restriction to Cython in the recipe itself (since it may not use the build isolation that pip install does) might help.

johnlees commented 1 year ago

The conda forge recipe might need to be changed. Potentially adding a version restriction to Cython in the recipe itself (since it may not use the build isolation that pip install does) might help.

Thanks for the pointer, this seems to have fixed it! Looks like we can have cython<3 when built but free version at run time and it works. I also added a run test to the recipe which I hope would flag such an issue in future releases

Gr4dient commented 1 year ago

Hi all, having trouble understanding what to do here (I installed HDBSCAN 2 days ago through Conda and I'm currently experiencing this issue). Can I remove and reinstall HDBSCAN through Conda at this point to solve the problem? If so, do I also need to remove and reinstall anything else? Cython? Thank you.

johnlees commented 1 year ago

@Gr4dient I would reinstall your HDBSCAN in that environment, or even just try a fresh conda environment. I hope to have fixed it in 0.8.33_3 releases (when you do conda list the hdbscan version should end in _3)

Gr4dient commented 1 year ago

Hi John, thanks for clarifying - it took several hours for Conda to find a solution to remove Cython and HDBSCAN from my NLP environment last night... not sure why it got so hung up. I'm not seeing '_3' on conda-forge; will that be available at some point soon? Thanks

johnlees commented 1 year ago

The new builds are on conda forge, e.g. in my working environment conda list shows:

hdbscan                    0.8.33        py310h1f7b6fc_3          conda-forge

If you are having trouble with time taken to resolve environments I would recommend using mamba instead of conda, or just starting over with a new environment, or both.

benmwebb commented 10 months ago

I can also reproduce this with a from-source build on Fedora 39:

# dnf install python3-devel python3-Cython python3-numpy python3-scipy python3-scikit-learn python3-setuptools gcc
# curl -LO https://files.pythonhosted.org/packages/44/2c/b6bb84999f1c82cf0abd28595ff8aff2e495e18f8718b6b18bb11a012de4/hdbscan-0.8.33.tar.gz
# tar -xvzf hdbscan-0.8.33.tar.gz 
# (cd hdbscan-0.8.33 && python3 setup.py build -j8)
# cat <<END > test.py
import hdbscan
from sklearn.datasets import make_blobs
data, _ = make_blobs(1000)
clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)
assert len(cluster_labels) == 1000
END
# PYTHONPATH=hdbscan-0.8.33/build/lib.linux-x86_64-cpython-312/ python3 test.py
...
  File "//hdbscan-0.8.33/build/lib.linux-x86_64-cpython-312/hdbscan/hdbscan_.py", line 80, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(
                                         ^^^^^^^^^^^^^
  File "hdbscan/_hdbscan_tree.pyx", line 659, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
TypeError: 'numpy.float64' object cannot be interpreted as an integer

A hacky fix which works for me is to replace https://github.com/scikit-learn-contrib/hdbscan/blob/0.8.33/hdbscan/_hdbscan_tree.pyx#L726-L729 with

    if allow_single_cluster:
        node_list = sorted([int(x) for x in stability.keys()], reverse=True)
    else:
        node_list = sorted([int(x) for x in stability.keys()], reverse=True)[:-1]