Closed rising02 closed 5 years ago
Ah, the catch is that the hard computation is dependent on min_samples
but not min_cluster_size
, however, when not specified hdbscan sets min_samples
to min_cluster_size
. What you want is
mem = Memory(cachedir='path/clustering')
clusterer = hdbscan.HDBSCAN(min_samples=50,min_cluster_size=50, algorithm='boruvka_kdtree', memory=mem).fit(data)
and then
clusterer = hdbscan.HDBSCAN(min_samples=50,min_cluster_size=100, algorithm='boruvka_kdtree',
memory=mem).fit(data)
Ah thank you very much, now that you mention it, it makes a lot of sense and was indirectly stated in the docs but I missed that detail.
A short notice in API Reference section would be great :)
Thank you very much for your quick help!
Hi there,
I went through the entire documentation for HDBSCAN and looked at the docs for joblib.Memory and had a look at issue #212 but I can't for the life of me figure out how to make use of the joblib.Memory function.
Since it is supposed to store the hard computation and thus make looking at different cluster sizes faster, I time my code and also check the directories. However, whenever I try to make use of the cached results a new directory is created.
My code in jupyter notebook currently looks like this:
Cell 1: %%time mem = Memory(cachedir='path/clustering') clusterer = hdbscan.HDBSCAN(min_cluster_size=50, algorithm='boruvka_kdtree', memory=mem).fit(data) Cell 2: %%time clusterer = hdbscan.HDBSCAN(min_cluster_size=100, algorithm='boruvka_kdtree', memory=mem).fit(data)
This always produces two caching directories and I can't seem to make use of the previously cached data.
Could you please help me out and tell me what I am missing?
Best regards