scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.81k stars 584 forks source link

Will scanpy implement ICA? #767

Open chris-rands opened 4 years ago

chris-rands commented 4 years ago

Are there plans to include independent component analysis (ICA) in Scanpy? This would be useful for me. I know that it is already implemented in Seurat. Thank you

ivirshup commented 4 years ago

Could you tell us about how you've found it useful, or point us towards some literature on it being used? I think this would be easy enough to implement, but when I tried a naive implementation the results weren't that compelling. It's very possible I should've played around with the parameters more. @flying-sheep, do you have thoughts on this?

Here's a simple implementation:

def ica(adata, n_components, inplace=True, **kwargs): 
    from sklearn.decomposition import FastICA 
    ica_transformer = FastICA(n_components=n_components, **kwargs) 
    x_ica = ica_transformer.fit_transform(adata.X) 
    if inplace:
        adata.obsm["X_ica"] = x_ica 
        adata.varm["ICs"] = ica_transformer.components_.T 
    else:
        return ica_transformer 
Fougere87 commented 4 years ago

Hi ! To answer about how it is useful, we are using ICA in our lab to a dataset of more than 100k cells, with a lot of complexity, and the main advantage of ICA against PCA is that it helps us detecting small populations of cells. As these small populations are not accounting for a lot of variance within the dataset, using a treshold on PCs, we discarded the PCs that would allow the separate them. Another advantage is that we do not make use of a selection of "highly variable genes" anymore, and use all genes expressed in more than 100 cells for the whole analysis... Doing the same and applying PCA gave us quite poor results..

We made use of Seurat implementation.. and I tried fastICA from sklearn once but I couldn't obtain similar results... I have not looked thoroughly into seurat's code tough...

Hope it helps ! Best

chris-rands commented 4 years ago

Thank you both. In terms of literature, see this paper (PMID: 30096299), in particular Figure S2

ivirshup commented 4 years ago

I looked through the R fastica package, and I think one of the main differences was the default tolerance. I've changed that and results seem a bit better. I don't think I have a great reference point to evaluate it though. Any chance you could provide a vignette of ICA being used with single cell data, so we can see if the python version can recapitulate the results?

fidelram commented 4 years ago

@chris-rands Any update on this? Does the code from @ivirshup gives you any meaningful results?

gokceneraslan commented 4 years ago

BTW there is a nice "Faster ICA" algorithm now, called Picard. It's implemented as a python package here:

https://pierreablin.github.io/picard/generated/picard.picard.html#picard.picard

It might be worthwhile to try this too.

ivirshup commented 4 years ago

@gokceneraslan, got around to giving picard a shot. For the pbmc3k dataset, it's a little slower than sklearns FastICA, though sklearn gives non-convergence warnings more often. Have you tried it before?

chris-rands commented 4 years ago

Thanks for the feedback and sorry for the delay. The code snippet using sklearn.decomposition.FastICA gave me quite similar results to PCA for my data in terms of the downstream UMAP visualisations (when I simply embedded 50 components in the neighbourhood graph). One difference is that the ICA was slower to compute. I am not confident to create a vignette, as I'm unclear what the 'correct' results should look like. I have not tried the picard implementation yet.

Fougere87 commented 4 years ago

Hi, I tried the snippet, with fastICA and picard, and with a number of cells higher than 30,000, the whitening step cannot be completed. This seems be due to some Lapack limitations. ValueError: Too large work array required -- computation cannot be performed with standard 32-bit LAPACK. I don't know how to get around this.... Best, Chloé

ivirshup commented 4 years ago

Sorry for the late response, completley forgot to post my response here.

@Fougere87, did that whitening issue occur with picard as well? I saw that with sklearn. I think we could get around that by whitening ourselves with ARPACK.

Picard and sklearn look pretty similar to me in a quick comparison. Below are top 16/30 components (ranked by Geary's C, autocorrelation on the connectivity graph) cell loadings on the pbmc3k dataset. The umap and connectivity matrix here were computed on top of a PCA – which I should maybe do differently. However I think the results are similar enough that it's probably not of consequence.

sklearn FastICA ![sklearn_ica](https://user-images.githubusercontent.com/8238804/68647787-d53a4d80-0572-11ea-8b95-cde9122824f1.png)
picard ICA ![picard_ica2](https://user-images.githubusercontent.com/8238804/68647808-e5eac380-0572-11ea-8485-71a770849cc9.png)