slowkow / harmonypy

🎼 Integrate multiple high-dimensional datasets with fuzzy k-means and locally linear adjustments.
https://portals.broadinstitute.org/harmony/
GNU General Public License v3.0
192 stars 22 forks source link

Use sklearn.cluster.KMeans for init #20

Closed johnarevalo closed 1 year ago

johnarevalo commented 1 year ago

This PR replaces scipy KMeans with scikit-learn version.

scipy.cluster.vq.kmeans2 is slower than sklearn.cluster.KMeans. This has been discussed before in scipy, but it seems the issue persists. Running the same snippet for this toy 500x200 matrix:

import numpy as np
from sklearn.cluster import KMeans
from scipy.cluster.vq import kmeans2
import time as t

data = np.loadtxt('data.csv.gz')
duration = {'scipy': [], 'sklearn': []}
k_opts = np.arange(10, 201, 10)

for k in k_opts:
    s = t.time()
    centroid1, labels1 = kmeans2(data, k, minit='++', iter=10, thresh=1.e-4)
    duration['scipy'].append(t.time() - s)

    s = t.time()

    model = KMeans(n_clusters=k, init='k-means++', tol=1.e-4,
                   max_iter=10, n_init=1, verbose=0)
    model.fit(data)
    centroid2, labels2 = model.cluster_centers_, model.labels_
    duration['sklearn'].append(t.time() - s)

image

Similar behavior has seen in other libraries: https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html#comparison-of-high-performance-implementations

slowkow commented 1 year ago

Hi John, thanks for opening a pull request.

When I swapped in kmeans2() to replace kmeans(), I also considered using sklearn instead of scipy. I considered the possible performance benefit versus the additional dependency.

This thread seems to suggest that the performance difference is negligible after a change to the code was merged in scipy v1.5.0rc1 sometime in April 2020: https://github.com/scipy/scipy/pull/11982

Your figure shows that the performance difference between sklearn.cluster.Kmeans and scipy.cluster.vq.kmeans2 can be measured in fractions of a second (approximately 0.5 seconds).

The run_harmony() function has this line:

https://github.com/slowkow/harmonypy/blob/2fd234e628fcc1b1cc6fb22f2bfbfe033dfb7dc5/harmonypy/harmony.py#L80

So, the maximum K when we run kmeans2() will be 100.

If I recall correctly, my own experiments (data not shown) seemed to suggest that both scipy.cluster.vq.kmeans2 and sklearn.cluster.Kmeans were demonstrating similar performance. So, I didn't see any reason to add the additional dependency.

I was also worried if using sklearn.cluster.Kmeans would reproduce the same issue that we faced with scipy.cluster.vq.kmeans — that the result does not always have exactly K clusters, but sometimes some other number of clusters. I couldn't tell from the documentation if sklearn.cluster.Kmeans is guaranteed to produce exactly K clusters.

Considering the lack of significant performance differences and my hesitation about generating exactly K clusters with sklearn, I opted to avoid the additional dependencies required by installing sklearn.

Do you think I made the wrong choice? Should I reconsider? Do you have an example of running run_harmony() with real data that can support the choice to use sklearn instead of scipy? Maybe my experiments were not catching a possible performance gain? What do you think? I'd be interested to hear your thoughts.

Thanks again!

johnarevalo commented 1 year ago

Thanks Kamil for your prompt reply.

I couldn't tell from the documentation if sklearn.cluster.Kmeans is guaranteed to produce exactly K clusters.

In cases where clusters collate, sklearn raises a warning but it still returns K centroids:

model = sklearn.cluster.KMeans(n_clusters=3)
model.fit([[1,1], [1,1], [2,2], [2,2], [2,2]])
# ConvergenceWarning: Number of distinct clusters (2) found smaller than n_clusters (3). Possibly due to duplicate points in X.
model.cluster_centers_.shape
Out[1]: (3, 2)

... the performance difference between sklearn.cluster.Kmeans and scipy.cluster.vq.kmeans2 can be measured in fractions of a second (approximately 0.5 seconds).

As data size and K increase the gap becomes non-negligible. I ran other comparison with more data and bigger K having kmeans2~1h vs sklearn.cluster.KMeans~6mins.

The run_harmony() function has this line, So, the maximum K when we run kmeans2() will be 100.

A bit more context about my use case. I am using harmony to remove batch effects in data (not scRNAseq) where we have > 2000 different biological concepts, so it'd expect to > 100 clusters.

In particular, I'm working with ~40,000 samples and ~2,000 features with K=300. Harmony initialization with kmeans2 does not finish after ~6 hours running. sklearn.cluster.Kmeans finishes in about one hour.

Perhaps this is legacy code, but it seems sklearn is already a dependency.

https://github.com/slowkow/harmonypy/blob/2fd234e628fcc1b1cc6fb22f2bfbfe033dfb7dc5/harmonypy/lisi.py#L20

As you can see, I'm in favor of adding sklearn 😄. sklearn is a pretty common library and it will hardly add conflicts with other tools, Also I would expect most users are already including them in their projects for reporting metrics, training models or as dependencies of other third-party libraries (e.g. scanpy, scVI).

slowkow commented 1 year ago

@johnarevalo If it is ok with you, I would like to eventually add you to the list of authors in setup.py, but I am still learning how to do python packaging correctly and it seems setup.py might not support multiple authors? I might try hatch and see if that makes packaging easier (hatch supports multiple authors).

johnarevalo commented 1 year ago

I'm happy to be included as author! I quickly searched and it seems pyproject.toml format with poetry could also do the trick.