joblib.Memory Problem - Githubissues

scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.

http://hdbscan.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

2.8k stars 501 forks source link

joblib.Memory Problem #292

Closed rising02 closed 5 years ago

rising02 commented 5 years ago

Hi there,

I went through the entire documentation for HDBSCAN and looked at the docs for joblib.Memory and had a look at issue #212 but I can't for the life of me figure out how to make use of the joblib.Memory function.

Since it is supposed to store the hard computation and thus make looking at different cluster sizes faster, I time my code and also check the directories. However, whenever I try to make use of the cached results a new directory is created.

My code in jupyter notebook currently looks like this:

Cell 1: %%time mem = Memory(cachedir='path/clustering') clusterer = hdbscan.HDBSCAN(min_cluster_size=50, algorithm='boruvka_kdtree', memory=mem).fit(data) Cell 2: %%time clusterer = hdbscan.HDBSCAN(min_cluster_size=100, algorithm='boruvka_kdtree', memory=mem).fit(data)

This always produces two caching directories and I can't seem to make use of the previously cached data.

Could you please help me out and tell me what I am missing?

Best regards

lmcinnes commented 5 years ago

Ah, the catch is that the hard computation is dependent on min_samples but not min_cluster_size, however, when not specified hdbscan sets min_samples to min_cluster_size. What you want is

mem = Memory(cachedir='path/clustering')
clusterer = hdbscan.HDBSCAN(min_samples=50,min_cluster_size=50, algorithm='boruvka_kdtree', memory=mem).fit(data)

and then

clusterer = hdbscan.HDBSCAN(min_samples=50,min_cluster_size=100, algorithm='boruvka_kdtree',
memory=mem).fit(data)

rising02 commented 5 years ago

Ah thank you very much, now that you mention it, it makes a lot of sense and was indirectly stated in the docs but I missed that detail.

A short notice in API Reference section would be great :)

Thank you very much for your quick help!