shaistamadad / GPLVM_Shaista

0 stars 0 forks source link

X_init not available in scvelo datasets #5

Closed shaistamadad closed 3 years ago

shaistamadad commented 3 years ago

X_init=adata.obsm["X_init"])

the pancreas, gastrulation and most other scvelo datasets don't have the X_init data structure in obsm slot, They contain obsm: 'X_pca', 'X_umap', 'X_tsne' etc..

I have been trying to use X_umap instead of X_init but get errors like these: Sizes of tensors must match except in dimension 2. Got 50 and 7 (The offending index is 0)

emdann commented 3 years ago

Here X_init is a dimensionality reduction to use as initialization for the GPLVM training. In the iPSC dataset we used the PCA dimensions, so you could try rerunning PCA with the same number of factors as what you are giving in input to the model.

sc.pp.pca(adata, n_comps=d)
adata.obsm['X_init'] = adata.obsm['X_pca'].copy()

It would also be very interesting to assess how important this initalization is in these datasets: if you train with X_init=None do you get the same/similar latent factors?

shaistamadad commented 3 years ago

ValueError: k must be between 1 and min(A.shape), k=4999; Running PCA with comps=d gives this error for all datasets. I think that's because d is the number of columns in the Y object which is larger than the number of rows: (n, d), q = Y.shape, 6; I tried using n-1 but that takes a really long time; also tried smaller values such as 50?

and get the same error as before: Sizes of tensors must match except in dimension 1. Expected size 7 but got size 50 for tensor number 1 in the list.

also, what's the difference between using sc.pp.pca() and sc.tl.pca()?

emdann commented 3 years ago

Aah my bad sorry: the number of latent dimensions for the model is q not d! in the iPSC dataset X_init has 22188 rows (= number of cells, adata.n_obs) and 7 columns. I suspect you need to give in input the number of dimensions specified in the model (i.e. q in model = GPLVM(n, d, q, n_inducing=64, period_scale=np.pi, X_init=adata.obsm["X_init"])) + 1 for the periodic kernel/cell cycle latent variable. So use sc.pp.pca(adata, n_comps=q+1).

This should work alright:

import scvelo as scv
import scanpy as sc

d=6
adata = scv.datasets.pancreas()
sc.pp.pca(adata, n_comps=d+1)

As far as I know there is no difference between sc.pp.pca and sc.tl.pca.

shaistamadad commented 3 years ago

worked! actually my bad too, I was running the PCA after running the model!