matteodellamico / flexible-clustering

Clustering for arbitrary data and dissimilarity function
BSD 3-Clause "New" or "Revised" License
90 stars 16 forks source link

Clustering 2 Million 20-d vectors using Euclidean distance takes too long during clustering phase #3

Open whaowhao opened 2 years ago

whaowhao commented 2 years ago

Hi Matteo,

Thanks for the great work! I am applying FISHDBC to cluster 2 million 20-dimensional data points using Euclidean distance, but contrary to what I expected by reading the paper - the build phase took 3 hours (which is reasonable), but the clustering phase (calling clusterer.cluster() ) is taking more than 4 hours and still running.

Wouldn't the clustering phase take much less time than build phase (as shown in the Table 8 household dataset runtime)? Do you have a hunch as to what is going on?

Thank you :)

matteodellamico commented 2 years ago

Hi @whaowhao and thanks for trying out this! For sure your use case doesn't look like one of the best use cases for FISHDBC, you may want to try HDBSCAN (https://hdbscan.readthedocs.io/). FISHDBC should work better with very high dimensionality and/or custom, expensive, distance functions. I wonder if you can share for what you're applying this algorithm...

That said, 1) you can probably improve your performance by using a fast, vectorized distance function (the vectorized parameter--undocumented, I know--should help). Say your dataset is a n*m numpy array where n is the number of elements and m is the dimension, you could define something like

 from scipy.spatial.distance import euclidean

 def d(i, js):
     return euclidean([data[i]], data[js])

and then pass it to FISHDBC with vectorized=True (add items by their indexes in your dataset, not by their values). You might speed up this a bit more eg by implementing the distance function with numba, if I recall correctly. 2) are you running out of RAM? there may be memory issues there 3) how are you running the algorithms? If you're just calling add() several times without ever calling update() you're postponing substantial work and it's not very surprising that clusterer.cluster() spends a lot of time in update().

whaowhao commented 2 years ago

Thank you Matteo!

I am aiming to use FISHDBC to cluster 10-50 millions 20-dimensional text embeddings. While HDBSCAN* can indeed handle array of 2 million x 20 faster than FISHDBC, it would have OOM problem (I believe) when scaling to 10 million. Thus I chose FISHDBC for this task.

So the clustering finished right after I made this post, so for array of size 2million x 20, it took ~3 hour in the building phase and ~4 hour in the cluster phase.

Besides using vectorized=True, do you have other suggestions for scaling the array to 50million x 20 ?

matteodellamico commented 2 years ago

Well, I'd need you to answer my question #3 :) Are you adding elements with .add() or .update()? update regularly calls update_mst, and if you're not using it you should call update_mst at regular intervals yourself to limit memory usage.

Provide for plentiful RAM and cross fingers :) I'd be very interested in knowing which dataset size you're managing to handle with which machine and which timing, if you can disclose that :)

Another, lateral, thing: FISHDBC was created to not need embeddings: if you have a (dis)similarity function that works directly on the raw data, maybe try it out (and let me know if you find interesting results)!

whaowhao commented 2 years ago

Gotcha, currently I am using .add() for all data points, and then do .cluster() after all points are added. I will try calling update_mst to see the speed/memory trade-offs. A side question, would calling update_mst make the overall pipeline slower? I have ~600Gb RAM, hopefully, that'd be enough :) Regarding embedding, unfortunately, edit distance that works directly on text doesn't work as well as Euclidean distance of text embeddings.

matteodellamico commented 2 years ago

If, as I expect, memory usage keeps growing linearly with the data set size, you should definitely be fine with 600 GB of RAM.

I expect some kind of speed/memory trade-off depending on the frequency with which you call update_mst, but I wouldn't expect much loss in terms of speed. I didn't have such a big machine to play on, so I guess you're on your own there. Looks like your computation is split almost half-half between MST maintenance and the HNSW part, which is interesting.