Closed johnarevalo closed 1 year ago
Hi John, thanks for opening a pull request.
When I swapped in kmeans2()
to replace kmeans()
, I also considered using sklearn instead of scipy. I considered the possible performance benefit versus the additional dependency.
This thread seems to suggest that the performance difference is negligible after a change to the code was merged in scipy v1.5.0rc1 sometime in April 2020: https://github.com/scipy/scipy/pull/11982
Your figure shows that the performance difference between sklearn.cluster.Kmeans
and scipy.cluster.vq.kmeans2
can be measured in fractions of a second (approximately 0.5 seconds).
The run_harmony()
function has this line:
So, the maximum K when we run kmeans2()
will be 100.
If I recall correctly, my own experiments (data not shown) seemed to suggest that both scipy.cluster.vq.kmeans2
and sklearn.cluster.Kmeans
were demonstrating similar performance. So, I didn't see any reason to add the additional dependency.
I was also worried if using sklearn.cluster.Kmeans
would reproduce the same issue that we faced with scipy.cluster.vq.kmeans
— that the result does not always have exactly K clusters, but sometimes some other number of clusters. I couldn't tell from the documentation if sklearn.cluster.Kmeans
is guaranteed to produce exactly K clusters.
Considering the lack of significant performance differences and my hesitation about generating exactly K clusters with sklearn, I opted to avoid the additional dependencies required by installing sklearn.
Do you think I made the wrong choice? Should I reconsider? Do you have an example of running run_harmony()
with real data that can support the choice to use sklearn instead of scipy? Maybe my experiments were not catching a possible performance gain? What do you think? I'd be interested to hear your thoughts.
Thanks again!
Thanks Kamil for your prompt reply.
I couldn't tell from the documentation if sklearn.cluster.Kmeans is guaranteed to produce exactly K clusters.
In cases where clusters collate, sklearn raises a warning but it still returns K centroids:
model = sklearn.cluster.KMeans(n_clusters=3)
model.fit([[1,1], [1,1], [2,2], [2,2], [2,2]])
# ConvergenceWarning: Number of distinct clusters (2) found smaller than n_clusters (3). Possibly due to duplicate points in X.
model.cluster_centers_.shape
Out[1]: (3, 2)
... the performance difference between sklearn.cluster.Kmeans and scipy.cluster.vq.kmeans2 can be measured in fractions of a second (approximately 0.5 seconds).
As data size and K increase the gap becomes non-negligible. I ran other comparison with more data and bigger K having kmeans2
~1h vs sklearn.cluster.KMeans
~6mins.
The run_harmony() function has this line, So, the maximum K when we run kmeans2() will be 100.
A bit more context about my use case. I am using harmony to remove batch effects in data (not scRNAseq) where we have > 2000 different biological concepts, so it'd expect to > 100 clusters.
In particular, I'm working with ~40,000 samples and ~2,000 features with K=300. Harmony initialization with kmeans2
does not finish after ~6 hours running. sklearn.cluster.Kmeans
finishes in about one hour.
Perhaps this is legacy code, but it seems sklearn is already a dependency.
As you can see, I'm in favor of adding sklearn 😄. sklearn is a pretty common library and it will hardly add conflicts with other tools, Also I would expect most users are already including them in their projects for reporting metrics, training models or as dependencies of other third-party libraries (e.g. scanpy, scVI).
@johnarevalo If it is ok with you, I would like to eventually add you to the list of authors in setup.py, but I am still learning how to do python packaging correctly and it seems setup.py might not support multiple authors? I might try hatch and see if that makes packaging easier (hatch supports multiple authors).
I'm happy to be included as author! I quickly searched and it seems pyproject.toml
format with poetry could also do the trick.
This PR replaces scipy KMeans with scikit-learn version.
scipy.cluster.vq.kmeans2
is slower thansklearn.cluster.KMeans
. This has been discussed before in scipy, but it seems the issue persists. Running the same snippet for this toy 500x200 matrix:Similar behavior has seen in other libraries: https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html#comparison-of-high-performance-implementations