matteodellamico / flexible-clustering

Clustering for arbitrary data and dissimilarity function
BSD 3-Clause "New" or "Revised" License
86 stars 16 forks source link

First issue! #1

Closed supertulli closed 4 years ago

supertulli commented 4 years ago

Hi! Just stumbled upon your work and we've been using hdbscan on a clustering problem which suffers with scalabilty. Been trying to put your package to work, but issues have ocurred.

Firstly on setup.py I needed to include :

`import numpy ...

setup(...

include_dirs=[numpy.get_include()]`

for the installation to complete successfully.

there is still an issue, though. Trying to run your example i stumbled upon this error:

ModuleNotFoundError: No module named 'flexible_clustering.unionfind'

Is there a solution to this?

And thank you so much for your work, looks really promising!

matteodellamico commented 4 years ago

Hi! Can you give more information about your OS and architecture? I don't need the modifications to setup.py in that case.

Also, can you make sure that you don't run the example from within the flexible_clustering directory? AFAIK it's a matter of path precedence, so python tries to load the library from the directory it's in rather than the one in which the compiled code is installed.

matteodellamico commented 4 years ago

Also, aside from this issue, I'd be very interested to know for what you plan to use FISHDBC. Can you share that?

thanks!

supertulli commented 4 years ago

Hi Matteo!

So I think I got it running after all... had to add a couple of instructions on the setup.py file. You can look it up in my fork.

Still can't put the example to run, though... and I believe it is because I'm on a Jupiter Notebook, running python 3.6 (Win10 based)

I shall confess I'm quite a noob in Python being that all my previous dwellings in ML and computational stats were on top of R.

Regarding the application, it is for a ML customer service bot which might need a 170k size sample frame with potentially more than 100 features to build.

HDBSCAN has worked the best so far, but so far only manageable for much smaller data. Scalability drove me to your work, and I found it to be very promising!

Thanks for the friendly reply,

Pedro

matteodellamico commented 4 years ago

Hi Pedro!

Yes, the fishdbc_example.py file is supposed to be run from a console.

With 100 features, it's not a given that FISHDBC will be faster than HDBSCAN, but you can certainly try. Most likely, FISHDBC will use less memory though. Which distance function are you using? Euclidean? Please let me know how it goes :)

supertulli commented 4 years ago

Hi Matteo,

sorry for the delayed answer, but more urgent matters have risen, and only now I've tried the example from a console. The issue seems to be conflicting parsers...

Now itt runs further, generating a plot, but it seems to have issues regarding the number of colours and the number of points are not coherent. The :

ValueError: 'c' argument has 200 elements, which is not acceptable for use with 'x' with size 50, 'y' with size 50.

For it to run I overriden c length to 50, and it finally generated the plot for some clusters for the initial process, but then it stops again when the x,y and c length diverge...

Also when I've run it over my dataset, it did perform under an hour! It was quite good, much better than expected.

Thanks again for the reply.

Pedro

matteodellamico commented 4 years ago

Hi Pedro!

thanks to you for trying this out. I hope it's working well for you also for the results!

I've committed some changes that should solve the problem for you.

Besides, have you tried playing around with the ef parameter? It's by far the most important one, and as I've seen in my paper, ef between 20 and 50 looks like a good set of tradeoffs between speed and result quality.

JishnuJayaraj commented 3 years ago

I came across the same error while using it. Installation was successful but when I do 'import flexible_clustering' it says 'No module named flexible_clustering.unionfind' (while I was in flexible-clustering directory)

matteodellamico commented 3 years ago

Yes. Please don't run it from the flexible_clustering directory. From elsewhere, it should work fine.

JishnuJayaraj commented 3 years ago

Wow quick reply thanks! but when I run from outside, it says No module named 'flexible_clustering'

matteodellamico commented 3 years ago

Are you sure the installation (python setup.py install) was successful? You should end up with flexible_clustering installed in your python path. Are you satisfying dependencies (python3, cython, hdbscan, scipy)?

JishnuJayaraj commented 3 years ago

I cloned the repo to my colab instance and run 'python3 setup.py install' and console output is below. Only thing that I notices was a warning on depreciated NumPy API disabled.

/content/flexible-clustering Compiling flexible_clustering/unionfind.pyx because it changed. [1/1] Cythonizing flexible_clustering/unionfind.pyx /usr/local/lib/python3.6/dist-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /content/flexible-clustering/flexible_clustering/unionfind.pyx tree = Parsing.p_module(s, pxd, full_module_name) running install running bdist_egg running egg_info creating flexible_clustering.egg-info writing flexible_clustering.egg-info/PKG-INFO writing dependency_links to flexible_clustering.egg-info/dependency_links.txt writing top-level names to flexible_clustering.egg-info/top_level.txt writing manifest file 'flexible_clustering.egg-info/SOURCES.txt' writing manifest file 'flexible_clustering.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py creating build creating build/lib.linux-x86_64-3.6 creating build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/fishdbc.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/pdict.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/fishdbc_example.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/extsort.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/optics.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/hnsw_optics_cachefile.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/hnsw_optics.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/hnsw.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/plot_optics.py -> build/lib.linux-x86_64-3.6/flexible_clustering copying flexible_clustering/init.py -> build/lib.linux-x86_64-3.6/flexible_clustering running build_ext building 'flexible_clustering.unionfind' extension creating build/temp.linux-x86_64-3.6 creating build/temp.linux-x86_64-3.6/flexible_clustering x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.6/dist-packages/numpy/core/include -I/usr/include/python3.6m -c flexible_clustering/unionfind.c -o build/temp.linux-x86_64-3.6/flexible_clustering/unionfind.o In file included from /usr/local/lib/python3.6/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1832:0, from /usr/local/lib/python3.6/dist-packages/numpy/core/include/numpy/ndarrayobject.h:12, from /usr/local/lib/python3.6/dist-packages/numpy/core/include/numpy/arrayobject.h:4, from flexible_clustering/unionfind.c:623: /usr/local/lib/python3.6/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]

warning "Using deprecated NumPy API, disable it with " \

^~~ x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.6/flexible_clustering/unionfind.o -o build/lib.linux-x86_64-3.6/flexible_clustering/unionfind.cpython-36m-x86_64-linux-gnu.so creating build/bdist.linux-x86_64 creating build/bdist.linux-x86_64/egg creating build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/fishdbc.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/pdict.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/unionfind.cpython-36m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/fishdbc_example.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/extsort.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/optics.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/hnsw_optics_cachefile.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/hnsw_optics.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/hnsw.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/plot_optics.py -> build/bdist.linux-x86_64/egg/flexible_clustering copying build/lib.linux-x86_64-3.6/flexible_clustering/init.py -> build/bdist.linux-x86_64/egg/flexible_clustering byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/fishdbc.py to fishdbc.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/pdict.py to pdict.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/fishdbc_example.py to fishdbc_example.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/extsort.py to extsort.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/optics.py to optics.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/hnsw_optics_cachefile.py to hnsw_optics_cachefile.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/hnsw_optics.py to hnsw_optics.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/hnsw.py to hnsw.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/plot_optics.py to plot_optics.cpython-36.pyc byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/init.py to init.cpython-36.pyc creating stub loader for flexible_clustering/unionfind.cpython-36m-x86_64-linux-gnu.so byte-compiling build/bdist.linux-x86_64/egg/flexible_clustering/unionfind.py to unionfind.cpython-36.pyc creating build/bdist.linux-x86_64/egg/EGG-INFO copying flexible_clustering.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO copying flexible_clustering.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying flexible_clustering.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying flexible_clustering.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt zip_safe flag not set; analyzing archive contents... flexible_clustering.pycache.unionfind.cpython-36: module references file creating dist creating 'dist/flexible_clustering-0.1.0-py3.6-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it removing 'build/bdist.linux-x86_64/egg' (and everything under it) Processing flexible_clustering-0.1.0-py3.6-linux-x86_64.egg creating /usr/local/lib/python3.6/dist-packages/flexible_clustering-0.1.0-py3.6-linux-x86_64.egg Extracting flexible_clustering-0.1.0-py3.6-linux-x86_64.egg to /usr/local/lib/python3.6/dist-packages Adding flexible-clustering 0.1.0 to easy-install.pth file

Installed /usr/local/lib/python3.6/dist-packages/flexible_clustering-0.1.0-py3.6-linux-x86_64.egg Processing dependencies for flexible-clustering==0.1.0 Finished processing dependencies for flexible-clustering==0.1.0

JishnuJayaraj commented 3 years ago

Oh yes, I think the problem was with the jupyter editing environment. I tried it both on colab and on local system and got the same error.

Restarting the kernel fixed the issue! in both systems.

Your paper looks interesting and thankyou so much for the implementation. I will play with it now and let you know