theislab / scib

Benchmarking analysis of data integration tools
MIT License
283 stars 61 forks source link

`cell_cycle` returns poor scores on perfect data input #351

Open scottgigante-immunai opened 1 year ago

scottgigante-immunai commented 1 year ago

If we pass PCA on the unintegrated data, we should get a perfect score. We don't.

>>> import scanpy as sc
>>> adata = sc.datasets.paul15()
>>> adata.obsm["X_emb"] = adata.X
>>> from scib.metrics import cell_cycle
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
1.0
>>> adata.obsm["X_emb"] = sc.tl.pca(adata, n_comps=50, use_highly_variable=False, svd_solver="arpack", copy=True).obsm["X_pca"]
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
0.8083203810376348

Related: https://github.com/openproblems-bio/openproblems/pull/706

mumichae commented 1 year ago

Hi Scott, this is likely related to the fact that when the PCA is recomputed, it's done so per batch. So essentially, the PC regression will be computed on different principle components, than if you were to use the globally computed PCA.

https://github.com/theislab/scib/blob/2fe05c7d5a18e63ca4bc6b6bcca4167d999c8ba6/scib/metrics/cell_cycle.py#L71

@LuckyMD I think we decided on per-batch computation of the PCA. Is this still the behaviour we want? If so it might make sense to default to encourage PCA recomputation or remove the reuse of existings PC components altogether to avoid confusion.

scottgigante-immunai commented 1 year ago

Oh! That make sense. It's a bit of a funny result then that the "perfect embedding" here is one in which the batches are embedded with PCA separately and then smashed together, while an embedding that keeps the raw data as-is performs relatively poorly... but at least from the openproblems perspective there is a simple solution here.

scottgigante-immunai commented 1 year ago
>>> import numpy as np
>>> import scanpy as sc
>>> adata = sc.datasets.paul15()
>>> adata.obsm["X_emb"] = np.zeros((adata.shape[0], 50), dtype=float)
>>> for batch in adata.obs["paul15_clusters"].unique():
...     batch_idx = adata.obs["paul15_clusters"] == batch
...     n_comps = min(50, np.sum(batch_idx))
...     solver = "full" if n_comps == np.sum(batch_idx) else "arpack"
...     adata.obsm["X_emb"][batch_idx,:n_comps] = sc.tl.pca(adata[batch_idx], n_comps=n_comps, use_highly_variable=False, svd_solver=solver, copy=True).obsm["X_pca"]
... 
>>> from scib.metrics import cell_cycle
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
0.9999998942341607
LuckyMD commented 1 year ago

Hah... yes! This still makes sense I think. We don't want the PCA to capture batch differences in the CC score. I forgot this completely! Thanks so much for highlighting this @mumichae !