scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

Buffer dtype mismatch, expected 'int64_t' but got 'int' #13

Closed JanBenes closed 8 years ago

JanBenes commented 8 years ago

Hi,

I tried running the plot_hdbscan.py example, but it failed with an error:

  File "X:/somepath/example.py", line 44, in <module>
    hdb = HDBSCAN(min_cluster_size=10).fit(X)
  File "C:\Python27\lib\site-packages\hdbscan\hdbscan_.py", line 520, in fit
    self._min_spanning_tree) = hdbscan(X, **self.get_params())
  File "C:\Python27\lib\site-packages\hdbscan\hdbscan_.py", line 362, in hdbscan
    gen_min_span_tree)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in __call__
    return self.func(*args, **kwargs)
  File "C:\Python27\lib\site-packages\hdbscan\hdbscan_.py", line 162, in _hdbscan_boruvka_kdtree
    alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, leaf_size=leaf_size // 3)
  File "hdbscan/_hdbscan_boruvka.pyx", line 273, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__ (hdbscan\_hdbscan_boruvka.c:4629)
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'int'

I am not really sure how to proceed now. I think it might be a configuration issue, but the only thing that I think might be relevant is that I had to manually download/install VCForPython27.msi, as instructed by pip, and that I had to manually install cython, as pip install hdbscan kept failing with a cython related error and I figured that might be the issue. I remember reading that cython has to use the same version of C/C++ compiler that was used to compile python itself, but I'm not sure how to verify that is indeed the case (python seems to have used MSC v.1500 32 bit), I can just assume pip pointed me to the right distribution, i.e. VCForPython27.msi.

I'm on Windows 10, Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)] on win32, I have MSVC 2015 installed (if relevant), and pip freeze reports:

cycler==0.9.0
Cython==0.23.4
decorator==4.0.4
fastcluster==1.1.20
hcluster==0.2.0
hdbscan==0.6
matplotlib==1.5.0
numpy==1.9.3
Pillow==3.0.0
pyparsing==2.0.6
python-dateutil==2.4.2
pytz==2015.7
scikit-learn==0.17
scipy==0.16.1
six==1.10.0

which exceeds your requirements.txt. Numpy is with MLK, all libraries installed either through pip or from Christoph Goelke's binaries (http://www.lfd.uci.edu/~gohlke/pythonlibs/).

Any other ideas as to what might be wrong? Thanks!

lmcinnes commented 8 years ago

I believe you are doing everything correctly; I don't have a lot of testing on Windows and that has been causing me some problems. As far as I can tell the issue is essentially that the contents of the internals of the sklearn kd-tree are in terms of int32s and I'm expecting int64s, which in turn I suspect stems from your 32bit install. I'm not sure there's an easy fix for that beyond an extensive code refactor that I've been putting off, but apparently need to get around to.

In the meantime you can do the following: whenever you instantiate an HDBSCAN object, be sure to pass

algorithm='prims_kdtree'

for euclidean metric or

algorithm='prims_balltree'

for other metrics (except precomputed, which should work fine as is). This runs a slower algorithm that should work on your system. I'll try and get the more major code overhaul made and let you know when I'm done.

lmcinnes commented 8 years ago

I just checked in a fix for this (hopefully -- I have no relevant test system available). If you could clone the repository and check if it now works for you I would be grateful!

lmcinnes commented 8 years ago

Now on PyPI as version 0.6.1. Closing for now; please reopen if the problem persists.

JanBenes commented 8 years ago
  1. The algorithm='prims_kdtree' workaround works for me.
  2. 0.6.1 fixed the issue even for the default choice of algorithm. It also resolved two deprecation warnings in sklearn. I still get two deprecation warnings when I run the clustering from my own code, namely
C:\Python27\lib\site-packages\sklearn\lda.py:4: DeprecationWarning: lda.LDA has been moved to discriminant_analysis.LinearDiscriminantAnalysis in 0.17 and will be removed in 0.19
  "in 0.17 and will be removed in 0.19", DeprecationWarning)
C:\Python27\lib\site-packages\sklearn\qda.py:4: DeprecationWarning: qda.QDA has been moved to discriminant_analysis.QuadraticDiscriminantAnalysis in 0.17 and will be removed in 0.19.
  "in 0.17 and will be removed in 0.19.", DeprecationWarning)
  1. There's still one warning in the example, coming from matplotlib. C:\Python27\lib\site-packages\matplotlib\lines.py:1107: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if self._markerfacecolor != fc: I do not need that fixed, just thought you might want to know.

Thanks a lot for the very prompt fix :+1:, I appreciate it. I'll be happy to run some tests on Windows for you if needed, just let me know if interested.