Open learningbymodeling opened 8 years ago
I think I need a little more information. Can you post the full stack trace that occurs at the error? Or possibly share the dataset that is failing?
Unfortunately, I can't share the dataset but I found a dataset online which produces the same result. The dataset is available at CrowdFlower and is labeled as "Identifying key phrases in text", so you can download it from there. Here is the standalone code to reproduce the error. Does this help? Thanks.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
csv_file_path = "Key-phrases-DFE-794640.csv"
raw_corpus = pd.read_csv(csv_file_path)
# Only interested in the column raw_corpus["answer"]
# Number of features in data set
feature_size = 2000
raw_corpus = raw_corpus.fillna("")
vectorizer = CountVectorizer(
strip_accents = "ascii",
analyzer = "word",
tokenizer = None,
preprocessor = None,
ngram_range = (1, 1),
stop_words = "english",
max_df = 1.00,
min_df = 0.01,
max_features = feature_size)
vectorizer = vectorizer.fit(raw_corpus["answer"])
feature_vects = vectorizer.transform(raw_corpus["answer"]).toarray()
vocab = vectorizer.get_feature_names()
import hdbscan
hdb = hdbscan.HDBSCAN(min_cluster_size=10)
try:
clusters_hdb = hdb.fit_predict(feature_vects)
except IndexError:
exc_type, exc_value, exc_traceback = sys.exc_info()
Traceback (most recent call last):
File "
", line 4, in clusters_hdb = hdb.fit_predict(feature_vects) File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 750, in fit_predict self.fit(X)
File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 732, in fit self._min_spanning_tree) = hdbscan(X, **kwargs)
File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 507, in hdbscan gen_min_span_tree, **kwargs)
File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in call return self.func(_args, *_kwargs)
File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 196, in _hdbscan_prims_kdtree min_spanning_tree = mst_linkage_core_vector(X, core_distances, dist_metric, alpha)
File "hdbscan/_hdbscan_linkage.pyx", line 51, in hdbscan._hdbscan_linkage.mst_linkage_core_vector (hdbscan_hdbscan_linkage.c:3840)
ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long long'
Is it possible that your feature vectors are all integer valued? Can you try casting it to float64? e.g.
clusters_hdb = hdb.fit_predict(feature_vects.astype(np.float64))
Let me know if that works. That sort of casting should be happening internally, and you shouldn't have know or worry about the type of your input data, but perhaps a check/conversion is missing.
On Fri, Nov 4, 2016 at 11:35 AM, learningbymodeling < notifications@github.com> wrote:
Does this help? Thanks.
Traceback (most recent call last):
File "", line 2, in clusters_hdb = hdb.fit_predict(feature_vects)
File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 750, in fit_predict self.fit(X)
File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 732, in fit self._min_spanning_tree) = hdbscan(X, **kwargs)
File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 507, in hdbscan gen_min_span_tree, **kwargs)
File "C:\Program Files\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in call return self.func(_args, *_kwargs)
File "C:\Program Files\Anaconda3\lib\site-packages\hdbscan\hdbscan_.py", line 196, in _hdbscan_prims_kdtree min_spanning_tree = mst_linkage_core_vector(X, core_distances, dist_metric, alpha)
File "hdbscan/_hdbscan_linkage.pyx", line 51, in hdbscan._hdbscan_linkage.mst_linkage_core_vector (hdbscan_hdbscan_linkage.c:3840)
ValueError: Buffer dtype mismatch, expected 'double_t' but got 'long long'
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/71#issuecomment-258464722, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBWmxCZe3G-PhyrhjtaOZXE96C-tdks5q61DUgaJpZM4KpJMj .
@lmcinnes
Let me know if that works. That sort of casting should be happening internally, and you shouldn't have know or worry about the type of your input data, but perhaps a check/conversion is missing.
I give faith that this issue is still there at least in the last pip package - I had to cast my uchar
vectors to np.float64
Sorry about that -- it fell off my radar. Should be fixed now, and I'll try to get a new pip package out soon to remedy the problem globally. Thanks for the heads up that I had missed this one.
The dtype mismatch still seems to be an issue with metric='precomputed'
. I get the error on fitting an np.float32
distance matrix, but casting to np.float64
fixes the problem.
I'll see if I can hunt that down. Sorry about the issue, but I'm glad that at least casting can provide a workaround for now.
FYI same problem 8/6/2019 with integer input and metric='precomputed'. Casting to float64 fixed.
Hi, same problem May 4th 2022 with float input and metric='precomputed'. Casting to float64 fixed.
I am trying to run hdbscan but I get the error:
I have attached my code below, it is standard from the example I have seen. Not sure how to proceed. Thank you