scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.87k stars 595 forks source link

gene co-expression networks #72

Open seth-ament opened 6 years ago

seth-ament commented 6 years ago

We are very impressed with the scalability of scanpy. We are interested in performing gene co-expression clustering on large single-cell RNAseq datasets. This typically involves calculating pairwise correlations between genes, then using these correlations as distance metrics for hierarchical and k-means clustering. Does scanpy already support these kinds of analyses?

jorvis commented 6 years ago

I would be very interested in helping to add any of these if they do not currently exist or aren't already in development.

falexwolf commented 6 years ago

Dear both, sorry about the late response... I've become the father of twins in the past weeks... Will respond much more quickly again soon.

Yes, we're working on this and will provide one solution within the next days. @tcallies could you push what you wrote?

You can then tell me if this does the job for you.

tcallies commented 6 years ago

Dear both,

correlation matrices are available now. Following our usual split into tools and plotting, you can call

sc.tl.correlation_matrix(adata,name_list, n_genes=20, annotation_key=None, method='pearson')

for correlation matrix calculation. I have left out a few parameters because I wrote the function actually to conveniently plot results from DE testing, but the basic functionality is the following:

adata is the usual AnnData object you are working with. _namelist is a string containing gene names and should be specified. _ngenes cuts the name_list if the number specified is smaller then the length of the list, so set this high enough if you want to work with large data _annotationkey allows you to specify a string that works as the key in the AnnData object where results are stored. By default, the key is "Correlation_matrix"

The method basically wraps the pd.DataFrame.corr method, which allows you to specify the correlation method ('pearson', 'spearman', 'kendall').

I use it for smaller data so it has not been optimized for performance (yet), but I tested the method for 3k cells and 600 genes and ended up with a runtime of ~8 seconds. I hope that is conveniently fast enough for you (if not let us know).

After calling the tool, you can plot correlation matrices (using a wrapper for seaborn heatmap) by calling

sc.pl.correlation_matrix(adata, annotation_key=None)

This function searches basically only the AnnData annotation (again, if no key specified, "Correlation_matrix" is the default).

Hope this does the job!

falexwolf commented 6 years ago

Cool, sounds great! Thank you! I will also play around with this. Why don't you add it to the documentation? Maybe here https://github.com/theislab/scanpy/blob/980aa00adca49f6aa994a6f870ad98c3ad9218af/scanpy/api/__init__.py#L60?

falexwolf commented 6 years ago

Ah! And we should also think about the naming convention here. Maybe gene_gene_correlation? We will have all kinds of correlation matrices floating around scanpy and we should have very specific naming conventions...

falexwolf commented 6 years ago

It will be hard to maintain an overview of what's going on with all the names that were not specific enough and had to be removed but still kept at some place to maintain backward compatibility.

falexwolf commented 6 years ago

@seth-ament @jorvis Having the correlation matrix, you then want to cluster it using hierarchical clustering, right? So, in order to achieve this, shall we add this functionality to clustermap, which currently clusters the expression matrix itself?

tcallies commented 6 years ago

I will certainly update my new stuff today at least once (probably more often ) and change the name / add the documentation and then let you know as soon as the name has changed

jorvis commented 6 years ago

That sounds right, yes. Looking forward to this being available.

seth-ament commented 6 years ago

Yes, thanks so much. This looks great. Typically, we cut the hierarchical tree to produce gene clusters, summarize these clusters as the mean expression of the genes within the cluster, then pass the mean expression profile to plotting functions like coloring tSNE plots and violin plots.

jorvis commented 6 years ago

Any updates here? I'd love to add this to an analysis tool UI I'm working on (and presenting at a conference this weekend). Very happy to promote scanpy there.

wyattmcdonnell commented 6 years ago

Hi all—does anybody have a skeleton snippet they're willing to share here on how to run this in the current version of Scanpy? Thanks!

falexwolf commented 6 years ago

Unfortunately, all of this discussion here was not really further pursued, I have to admit.

In principle, these are very simple things. However, I'm a bit afraid of offering a canonical function as I fear that there are also a lot of bad ways of visualizing gene correlation plots and I don't feel capable of judging this. If no one else wants to make a pull request for that (maybe using what @tcallies already did, but I fear it's not really serving the purpose of the discussion here: here, here) it would be cool if someone sent me an example case, which clearly shows what you want.

Maybe @jorvis, you can send images for the examples you have in mind?

flying-sheep commented 5 years ago

It’s still not in the docs, and by now also broken… #392

hyjforesight commented 1 year ago

Hello @tcallies @falexwolf @flying-sheep Somehow, it looks like sc.tl.correlation_matrix was removed from scanpy?

sc.tl.correlation_matrix(adata_sub2, name_list=['SMARCA4', 'TP53'], n_genes=20, annotation_key=None, method='pearson')
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_3196/1400689712.py in <module>
----> 1 sc.tl.correlation_matrix(adata_sub2, name_list=['SMARCA4', 'TP53'], n_genes=20, annotation_key=None, method='pearson')

AttributeError: module 'scanpy.tools' has no attribute 'correlation_matrix'
mssher07 commented 1 year ago

same error, seconded -- is there an alternative approach built in?

mys721tx commented 1 year ago

Looks like when _top_genes.py is renamed, correlation_matrix is no longer exported.

mys721tx commented 1 year ago

A very dodgy workaround would be

from scanpy.tools import _top_genes
from scanpy.plotting import _anndata

_top_genes.correlation_matrix(adata, names, annotation_key=None, method='pearson')

_anndata.correlation_matrix(adata, groupby='leiden')
mhorlacher commented 1 month ago

Any updates on this? Has correlation_matrix() been removed?