scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 502 forks source link

buffer dtype mismatch #71

Open learningbymodeling opened 8 years ago

learningbymodeling commented 8 years ago

I am trying to run hdbscan but I get the error:

ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long long'

I have attached my code below, it is standard from the example I have seen. Not sure how to proceed. Thank you

import hdbscan
hdb = hdbscan.HDBSCAN(min_cluster_size=10)
clusters_hdb = hdb.fit_predict(feature_vects)
lmcinnes commented 8 years ago

I think I need a little more information. Can you post the full stack trace that occurs at the error? Or possibly share the dataset that is failing?

learningbymodeling commented 8 years ago

Unfortunately, I can't share the dataset but I found a dataset online which produces the same result. The dataset is available at CrowdFlower and is labeled as "Identifying key phrases in text", so you can download it from there. Here is the standalone code to reproduce the error. Does this help? Thanks.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

csv_file_path = "Key-phrases-DFE-794640.csv"

raw_corpus = pd.read_csv(csv_file_path)

# Only interested in the column raw_corpus["answer"]

# Number of features in data set
feature_size = 2000

raw_corpus = raw_corpus.fillna("")

vectorizer = CountVectorizer(
                        strip_accents = "ascii",
                        analyzer = "word",
                        tokenizer = None,
                        preprocessor = None,
                        ngram_range = (1, 1),
                        stop_words = "english",
                        max_df = 1.00,
                        min_df = 0.01,
                        max_features = feature_size)

vectorizer = vectorizer.fit(raw_corpus["answer"])
feature_vects = vectorizer.transform(raw_corpus["answer"]).toarray()
vocab = vectorizer.get_feature_names()

import hdbscan
hdb = hdbscan.HDBSCAN(min_cluster_size=10)
try:
    clusters_hdb = hdb.fit_predict(feature_vects)
except IndexError:
    exc_type, exc_value, exc_traceback = sys.exc_info()

Traceback (most recent call last):

File "", line 4, in clusters_hdb = hdb.fit_predict(feature_vects)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 750, in fit_predict self.fit(X)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 732, in fit self._min_spanning_tree) = hdbscan(X, **kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 507, in hdbscan gen_min_span_tree, **kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in call return self.func(_args, *_kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 196, in _hdbscan_prims_kdtree min_spanning_tree = mst_linkage_core_vector(X, core_distances, dist_metric, alpha)

File "hdbscan/_hdbscan_linkage.pyx", line 51, in hdbscan._hdbscan_linkage.mst_linkage_core_vector (hdbscan_hdbscan_linkage.c:3840)

ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long long'

lmcinnes commented 8 years ago

Is it possible that your feature vectors are all integer valued? Can you try casting it to float64? e.g.

clusters_hdb = hdb.fit_predict(feature_vects.astype(np.float64))

Let me know if that works. That sort of casting should be happening internally, and you shouldn't have know or worry about the type of your input data, but perhaps a check/conversion is missing.

On Fri, Nov 4, 2016 at 11:35 AM, learningbymodeling < notifications@github.com> wrote:

Does this help? Thanks.

Traceback (most recent call last):

File "", line 2, in clusters_hdb = hdb.fit_predict(feature_vects)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 750, in fit_predict self.fit(X)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 732, in fit self._min_spanning_tree) = hdbscan(X, **kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 507, in hdbscan gen_min_span_tree, **kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in call return self.func(_args, *_kwargs)

File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 196, in _hdbscan_prims_kdtree min_spanning_tree = mst_linkage_core_vector(X, core_distances, dist_metric, alpha)

File "hdbscan/_hdbscan_linkage.pyx", line 51, in hdbscan._hdbscan_linkage.mst_linkage_core_vector (hdbscan_hdbscan_linkage.c:3840)

ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long long'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/71#issuecomment-258464722, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBWmxCZe3G-PhyrhjtaOZXE96C-tdks5q61DUgaJpZM4KpJMj .

edgarriba commented 7 years ago

@lmcinnes

Let me know if that works. That sort of casting should be happening internally, and you shouldn't have know or worry about the type of your input data, but perhaps a check/conversion is missing.

I give faith that this issue is still there at least in the last pip package - I had to cast my uchar vectors to np.float64

lmcinnes commented 7 years ago

Sorry about that -- it fell off my radar. Should be fixed now, and I'll try to get a new pip package out soon to remedy the problem globally. Thanks for the heads up that I had missed this one.

VarIr commented 7 years ago

The dtype mismatch still seems to be an issue with metric='precomputed'. I get the error on fitting an np.float32 distance matrix, but casting to np.float64 fixes the problem.

lmcinnes commented 7 years ago

I'll see if I can hunt that down. Sorry about the issue, but I'm glad that at least casting can provide a workaround for now.

jerrykaplan commented 5 years ago

FYI same problem 8/6/2019 with integer input and metric='precomputed'. Casting to float64 fixed.

giulia-antinori commented 2 years ago

Hi, same problem May 4th 2022 with float input and metric='precomputed'. Casting to float64 fixed.