openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
287 stars 76 forks source link

Use scib-metrics implementations for batch integration embedding task #694

Open adamgayoso opened 1 year ago

adamgayoso commented 1 year ago

For the batch integration embedding task we have developed this package:

https://scib-metrics.readthedocs.io/en/stable/

with python-only, jax-based implementations. This would easily allow fast clisi/ilisi. We also sped up:

https://scib-metrics.readthedocs.io/en/stable/generated/scib_metrics.nmi_ari_cluster_labels_leiden.html#scib_metrics.nmi_ari_cluster_labels_leiden

by using joblib and a faster leiden implementation in igraph.

LuckyMD commented 1 year ago

I think it would make sense to compare metric reimplementations to the original metrics and then make a call on picking one set in successive iterations of open problems.

adamgayoso commented 1 year ago

We are testing for metric value equivalence in our testing suite

LuckyMD commented 1 year ago

Okay. I guess this might still change when testing across datasets. I think @mumichae discovered some differences on our end... mainly for the clustering-based methods that now use k-means.

adamgayoso commented 1 year ago

Yes we added a kmeans metric using known K, but are still working on speed

https://scib-metrics.readthedocs.io/en/stable/generated/scib_metrics.nmi_ari_cluster_labels_kmeans.html#scib_metrics.nmi_ari_cluster_labels_kmeans

We also implemented the scib way with optimized leiden (instead of louvain) clustering

https://scib-metrics.readthedocs.io/en/stable/generated/scib_metrics.nmi_ari_cluster_labels_leiden.html

this one should be significantly faster than scib's implementation because:

  1. Leiden is faster
  2. We are using joblib
  3. We are using a faster leiden implementation than what is in scanpy