scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 602 forks source link

How to extract cells co-expressing certain genes of interest ? #490

Open aditisk opened 5 years ago

aditisk commented 5 years ago

Is there a way to extract cells that are co-expressing my genes of interest ? How can I make a UMAP plot showing these cells similar to the single gene UMAP plots ?

Thanks.

GMaciag commented 4 years ago

I know the question is quite old, but maybe someone else will stumble upon it and in that case, I'd like to give a solution I used with my data.

To check for expression you need to access raw matrix of counts in your data. That is, it can be log transformed and normalised, but shouldn't be scaled or regressed. Following many of the tutorials you should have the matrix in your .raw slot.

gene1 = 'XXX'
gene2 = 'YYY

adata.obs['CoEx'] = (adata.raw[:,'{}'.format(gene1)].X.todense() > 0) &
                    (adata.raw[:,'{}'.format(gene2)].X.todense() > 0)

That will add to your anndata object one more metric, which you can then use to colour your umap plot (i.e. sc.pl.umap(adata, color='CoEx')).

One thing quite annoying with this solution is that you'll end up with a meaningless colorbar on your umap plots. I welcome suggestions on how to improve it.

flying-sheep commented 4 years ago

Thank you! I hope this helps people stumbling upon this.

By now we have https://scanpy.discourse.group which is the better place for questions like that :+1:

About the colorbar thing, you could just deep-dive into the plot object and remove it:

ax = plt.subplot()
sc.pl.xx(..., ax=ax)
ax.images.im[-1].colorbar.remove()
LuckyMD commented 4 years ago

Hi @GMaciag,

This looks like a simple function that people may like to use. Do you want to write a small helper function for this maybe? This might be nice to add to sc.tl.

One way you could make it display nicely in sc.pl.umap() is by turning the values into pd.Categorical. In the end you want to show which cells are co-expressing your genes.

Also, this may be a good use for imputation methods. Otherwise you may struggle with the sparsity of the data.

GMaciag commented 4 years ago

Hi @LuckyMD,

Sure, I'll work on it, as time allows. Before however, I have a couple of questions.

  1. Do you want it as a separate .py file in the tools module (similar to _dendrogram.py)?
  2. I also found interesting to look at exclusive expression of one gene and not the other. Would you be interested in adding a function for that as well and if so, should be a separate one or somehow integrated with coexpression?
  3. Turning values into categorical works, however now I have problem that the True (coexpressing) cells are not always plotted on top. Do you know how to do it in scanpy? I tried by setting pd.Categorical(ordered = True), however, that doesn't help.
  4. Could you elucidate on how you want to implement the imputation methods? I've never used them myself. Is there anything available in scanpy already?

And thanks @flying-sheep for showing how to remove the colourbar. I wanted to do it for some of my other plots, so that really helps.

GMaciag commented 4 years ago

Regarding Q3 from my previous comment, I tried few things and I think it is the easiest to keep the coexpression data as continuous and remove the colorbar afterwards.

I have, however, correction to what what was written before. ax.images.im[-1].colorbar.remove() doesn't work (in the case of umap) since it is a scatter plot. ax.collections[-1].colorbar.remove() needs to be used instead.

LuckyMD commented 4 years ago

Hey! Sorry for the late reply:

  1. Yes, a separate file, please.
  2. Put both in the same function. I wouldn't call it coexpression though. Something along the lines of cell_selection_by_genes() or just cell_selection().
  3. There is a sort_order keyword for plotting which works for continuous covariates. I imagine that should work.
  4. That may be overkill... but it would definitely be interesting. I think MAGIC is in sc.external and DCA is also easily usable in this framework. They are not part of the core package though.
ilcink commented 3 years ago

Hey! This

gene1 = 'XXX'
gene2 = 'YYY

adata.obs['CoEx'] = (adata.raw[:,'{}'.format(gene1)].X.todense() > 0) &
                    (adata.raw[:,'{}'.format(gene2)].X.todense() > 0)

was very helpful, thanks for posting, @GMaciag !

Is there a similar way to split the dataset into gene1/gene2+ and gene1/gene2- cells, so that both of the categories are within 'CoEx' and you could compare them to each other just like leiden clusters for example?

GMaciag commented 3 years ago

Hi @natalkon

If you mean to plot them as categories instead of a continuous scale, then the solution is to turn the values into pd.Categorical like LuckyMD mentioned.

coex = (adata.raw[:,'{}'.format(gene1)].X.todense() > 0) &
             (adata.raw[:,'{}'.format(gene2)].X.todense() > 0)
coex_list = [item for sublist in coex.tolist() for item in sublist]
adata.obs['CoEx'] = pd.Categorical(coex_list, categories=[True, False])

Like I mentioned before, one problem is that the True (coexpressing) cells are not always plotted on top when plotting both categories with umap. A better way of visualising is to make use of the groups parameter:

sc.pl.umap(adata, color='CoEx', groups=[True])

It will then grey out all the False cells and put them in the background.

GMaciag commented 3 years ago

Hi

It's been a long time since the initial talk about writing a function addressing this issue, but with the whole pandemic situation I totally forgot about it. But the recent comments made me remember and so I finished writing it and put it in the PR #1657.

I hope it can be useful to other people and maybe even included in the main package :)

ilcink commented 3 years ago

Thank you very much, @GMaciag !!