openpipelines-bio / openpipeline

https://openpipelines.bio
MIT License
25 stars 11 forks source link

GEX/FB intersection module #519

Closed Isabelle-b closed 5 months ago

Isabelle-b commented 8 months ago

Hey all,

As discussed here is the ticket to create a module allowing intersection between GEX and FB. Here is the code we were using:

image image image image image image image

Thanks!

Isabelle

rcannood commented 8 months ago

Just to verify, the proposed component will subset a mudata to contain only the intersecting set of cells across modalities, right?

A concern here is that if a lot of high quality data might be removed from one modality if something went wrong with the other modality. Depending on how much of the data is removed, I think the component should produce an error. For example, if more than 80% of the cells in the RNA modality are removed, the component should error. What would you suggest?

rcannood commented 8 months ago

Also, would it be possible to copy paste the code instead of screenshots? This makes it a lot easier for us to work with the code.

Isabelle-b commented 8 months ago

Hey Robrecht,

Thanks! Yes indeed it would be nice to keep the info on what amount of cells are removed from each layer and which cells. And potentially add a warning message in case. It could even be all on same object as we commonly look at both the intersection and non-intersection ones.

As a side note, the filtering on FB data is currently super lenient, so we will likely remove lots of cells for which some antibody info is still there but no good quality RNA is detected.

Here are the chunks I used:

import scanpy as sc
import muon as mu

mdata = mu.read_h5mu(path)
mdata = mdata[(mdata.obs.index.isin(mdata["rna"].obs.index))&(mdata.obs.index.isin(mdata["prot"].obs.index))]
mdata.update()

# remove calculated layers and re-run with new subset
del mdata['rna'].var['highly_variable']
del mdata['rna'].uns['pca_variance']
del mdata['rna'].obsm['X_pca']
del mdata['rna'].obsm['X_umap']
del mdata['rna'].varm['pca_loadings']
del mdata['rna'].obsp['distances']
del mdata['rna'].obsp['connectivities']
del mdata['rna'].uns['neighbors']

# remove calculated layers and re-run with new subset
del mdata['prot'].uns['pca_variance']
del mdata['prot'].obsm['X_pca']
del mdata['prot'].obsm['X_umap']
del mdata['prot'].varm['pca_loadings']
del mdata['prot'].obsp['distances']
del mdata['prot'].obsp['connectivities']
del mdata['prot'].uns['neighbors']

mdata['rna'].uns['pca_variance'] = {}
del mdata['rna'].uns['pca_variance']

mdata['prot'].uns['pca_variance'] = {}
del mdata['prot'].uns['pca_variance']
mdata.update()

mdata['rna'].uns['log1p']['base'] = None

tmp_rna = mdata['rna'].X.copy()
mdata['rna'].X = mdata['rna'].layers['log_normalized']

hvg_object = sc.pp.highly_variable_genes(mdata['rna'],  flavor = "seurat", batch_key = "sample_id", subset=False, inplace=False) # careful, 'layer' option not working!
mdata['rna'].var['highly_variable'] = hvg_object['highly_variable']
sc.tl.pca(mdata['rna'], n_comps = 25, copy = False, use_highly_variable = True)
mdata['rna'].X = tmp_rna

tmp_prot = mdata['prot'].X.copy()
mdata['prot'].X = mdata['prot'].layers['clr']

sc.tl.pca(mdata['prot'], n_comps= 25, copy=False, use_highly_variable = False) 

mdata['prot'].X = tmp_prot

mdata.write(path_v2)

Main library versions: scanpy==1.9.2 anndata==0.9.2 umap==0.5.3 numpy==1.22.4 scipy==1.6.3 pandas==1.5.3 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.10.6 louvain==0.8.1 pynndescent==0.5.10

I have everything in my project folder. I can point it to you too.