scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.82k stars 586 forks source link

sc.tl.rank_genes_groups to rank only specific genes of interest? #748

Open Olivia117 opened 4 years ago

Olivia117 commented 4 years ago

After clustering cells with a restricted gene set, I would like to see the contribution of "specified genes" in subgrouping the cells. "sc.tl.rank_genes_groups" uses all the genes in the background for the statistical calculations. I want to test it for all the Louvain groups against the rest of the data (so, groups='all', reference='rest'). Is there a way, we can specify the gene list? (I tried using the 'use_raw' of sc.tl.rank_genes_groups to subset). I don't find any other options to restrict gene lists here.

`subset_genes = ldata[:, ['Gabrg1', 'Ntrk1', 'Htr1a', 'Plaur', 'Il31ra', 'Gabrg3', 'P2rx3', 'Oprk1', 'P2ry1', 'Cnih3']]

sc.tl.rank_genes_groups(ldata, 'louvain', method='wilcoxon', use_raw= 'subset_genes', n_genes = 100)

sc.pl.rank_genes_groups(ldata, n_genes=15, sharey=False)

LuckyMD commented 4 years ago

Hi,

sc.tl.rank_genes_groups() treats each gene as an independent variable in the test. Thus, the only difference if you were to subset the genes would be that the multiple testing correction would be over fewer genes. You can also do that manually by looking at the adata.uns['rank_genes']['pvals'][CLUSTER_ID] and doing the multiple-testing correction yourself over the gene set you care about. However, the p-values of this test are inflated anyway, and therefore they should be used with caution.

You should be able to extract your test results of interest by doing something along the lines of this:

CLUST_ID = 0
gene_list = ['Gabrg1', 'Ntrk1', 'Htr1a', 'Plaur', 'Il31ra', 'Gabrg3', 'P2rx3', 'Oprk1', 'P2ry1', 'Cnih3']
gene_mask = [gene in gene_list for gene in adata.uns['rank_genes']['names'][CLUST_ID]]
results = adata.uns['rank_genes']['pvals'][CLUST_ID][gene_mask]

Then you need to perform multiple testing correction over those p-values. And that would be the result you would get from a subsetting. However, multiple-testing over only those values, assumes you will not use the other gene results for anything. If you use the other gene results for something else, then you should just use the results of sc.tl.rank_genes_groups() as it is.

Also note that sc.tl.rank_genes_groups() doesn't really tell you the contribution of genes to the clustering, but it just tells you what genes are characteristic of a cluster in the output. Those aren't the same things. For example, one gene could have been responsible for partitioning the data into 2 parts, but then after subclustering those 2 parts it may not show up as a marker gene in the sc.tl.rank_genes_groups results.

Olivia117 commented 4 years ago

Hi, Thank you very much for such a detailed explanation. It really helps. I've two more questions:

1). Can we do this gene subsetting with Logistic regression (where no multiple testing correction is involved)?

2). Since you nicely pointed out sc.tl_rank_genes_groups doesn't tell about the contribution of genes in the clustering- are there tools that can be integrated with ScanPy to do this job? (for example, diffxpy or MAST). I'm really interested in the differential gene testing to predict the markers (from a gene subset used for clustering).

I shall be grateful if you can suggest a method.

LuckyMD commented 4 years ago

Hey!

Logistic regression currently doesn’t output p-values i believe. Either way, it also treats genes as independent variables, so no need for subsetting here either.

As for your second question, you may have misunderstood my answer. sc.tl.rank_genes_groups() gives you marker genes just as MAST or diffxpy do (but with more complex models that can incorporate covariates). I was just commenting on the interpretation of marker genes. They tell you which genes characterize a cluster, but don’t necessary tell you which genes contributed most to the global split of clusters that was generated (which i thought you were asking about). That type of question would require a feature importance metric on a multiclass classification problem. For example training a random forest to predict the clusters and then using gini importance to rank the features. That is not a common question asked of single-cell data though, so there’s no tool i’m familiar with that does this. I hope that is clearer.

susanGhaderi commented 3 years ago

Hi, I would like to use scanpy for differential gene expression between four-time points. So for each time point, I have 250 cells and 20000 genes and the gene expression matrix. I do not know how should I use annData to use scanpy to find differential gene expression between each time point. Could you please help me with this problem?