Open scottgigante-immunai opened 1 year ago
Hi Scott, this is likely related to the fact that when the PCA is recomputed, it's done so per batch. So essentially, the PC regression will be computed on different principle components, than if you were to use the globally computed PCA.
@LuckyMD I think we decided on per-batch computation of the PCA. Is this still the behaviour we want? If so it might make sense to default to encourage PCA recomputation or remove the reuse of existings PC components altogether to avoid confusion.
Oh! That make sense. It's a bit of a funny result then that the "perfect embedding" here is one in which the batches are embedded with PCA separately and then smashed together, while an embedding that keeps the raw data as-is performs relatively poorly... but at least from the openproblems perspective there is a simple solution here.
>>> import numpy as np
>>> import scanpy as sc
>>> adata = sc.datasets.paul15()
>>> adata.obsm["X_emb"] = np.zeros((adata.shape[0], 50), dtype=float)
>>> for batch in adata.obs["paul15_clusters"].unique():
... batch_idx = adata.obs["paul15_clusters"] == batch
... n_comps = min(50, np.sum(batch_idx))
... solver = "full" if n_comps == np.sum(batch_idx) else "arpack"
... adata.obsm["X_emb"][batch_idx,:n_comps] = sc.tl.pca(adata[batch_idx], n_comps=n_comps, use_highly_variable=False, svd_solver=solver, copy=True).obsm["X_pca"]
...
>>> from scib.metrics import cell_cycle
>>> cell_cycle(adata, adata, "paul15_clusters", embed="X_emb")
0.9999998942341607
Hah... yes! This still makes sense I think. We don't want the PCA to capture batch differences in the CC score. I forgot this completely! Thanks so much for highlighting this @mumichae !
If we pass PCA on the unintegrated data, we should get a perfect score. We don't.
Related: https://github.com/openproblems-bio/openproblems/pull/706