differential expression analysis and GSEA

Sophon-0 commented 8 months ago

Hello,

I was using decoupleR for pathway analysis in single cell.

I tried a few lines:

msigdb <- decoupleR::get_resource("MSigDB")
msigdb = msigdb[msigdb$collection =='hallmark', ]

# Enrichment with Over Representation Analysis (ORA)
mat <- as.matrix(seurat_object@assays$RNA@data)
result <- run_ora(
  mat=mat,
  network=msigdb,
  .source='geneset',
  .target='genesymbol'
)

Where should I specify the case group and the control group ? Where do we specify the statistics chosen ? Looking at the output, it seems to be considering every single cell as a condition.

PauBadiaM commented 8 months ago

Hi @AgentScientist,

Pathway analysis can be performed at two levels: at the observation (cell) or contrast. In your case, you seem interested in the difference between conditions so first you will have to generate contrast level gene statistics using your favorite differential expression framework (limma, deseq2, edger, etc.). Because you are working with single cell data, it is a good practice to perform DEA at the pseudobulk level to reduce the overinflation of p-values (check this ref if you want to read more about it). Once you have performed DEA on your pseudobulk profiles you can perform any enrichment analysis that you want. In this vignette, we show how to perform pseudobulking, DEA with DESeq2 and different enrichment analyses you can perform. Unfortunately, it is only available in the python version of decoupler, but should be easy to follow/adapt in R too. Hope this is helpful!

Sophon-0 commented 8 months ago

Hello, So I have in total 6 samples from 3 donors. 2 conditions: normal versus treatment.
So for each donor, there are 2 samples, one normal, one treated.
I want to compare cell type X in treatment versus control.

"We know that single cells within a sample are not independent of each other, since they were isolated from the same environment. If we treat cells as samples, we are not testing the variation across a population of samples, rather the variation inside an individual one. "

Would this really be a problem ? I mean, that cell type X in treated group is in a totally different state than in control group, and it's a 3 vs 3. I got biological more sound result using Seurat::FindMarkers + PIANO than using the pseudobulk method. Maybe I'm missing something.

At the step of pseudo bulk generation. At the sample column, is it ok to choose sample, knowing that some samples come from the same donor?

pdata = dc.get_pseudobulk( adata, sample_col='sample', groups_col='cell_type', layer='counts', mode='sum', min_cells=10, min_counts=1000 )

PROGENy works perfectly fine. I'm just having some issues with the MsigDB analysis, which missed some key pathways that I got using Seurat::FindMarkers + PIANO

PauBadiaM commented 8 months ago

Hi @AgentScientist,

I would strongly advice against performing DEA at the single-cell level followed by enrichment analysis. Even if you have a balanced experimental design, still samples might contribute different number of cells which will bias the test, using single-cells as observations breaks the assumption of any DEA test that samples are independent from each other, and also it overinflates the obtained p-values (so you will get many false positives).

Regarding your pseudobulk results, one thing you might try is to be less strict with the gene filtering with dc.plot_filter_by_expr. Are you keeping enough genes after filtering for DEA testing?

Regarding your patient vs sample metadata, you can do two things: i) use the patient id as your sample col (you summarize per patient), ii) you include the patient id as a covariate to the DESeq2 model to correct for that.

Hope this is helpful!

Sophon-0 commented 8 months ago

Hello, it was a geneset issue ! I used the standard MSigDB 1329 pathways, and I got same result as before.

Sophon-0 commented 8 months ago

"Regarding your patient vs sample metadata, you can do two things: i) use the patient id as your sample col (you summarize per patient), ii) you include the patient id as a covariate to the DESeq2 model to correct for that."

1) So I have these 2 columns in my adata: "Donor" and "sample" (for each donor, there are 2 samples, one normal, one treated)

pdata = dc.get_pseudobulk(
      adata,
      sample_col='**Donor**',
      groups_col='cell_type',
      layer='counts',
      mode='sum',
      min_cells=10,
      min_counts=1000 
)

If I do this, in the resulting pdata, for some reason I lose a few columns: the conditions (treated/normal) columns for ex. I keep all the columns only when I do sample_col='sample'

I actually think using "sample" is fine, since we don't want to mix the cell type within donors.

2) Do you also mean to include "Donor" in the DeseqDataSet function ? I don't see any options for this.

I did this:

dds = DeseqDataSet(
    adata=DCs,
    design_factors=['conditions',"**Donor**"],
    ref_level=['conditions','Normal'],
    refit_cooks=True,
    inference=inference 
)

3) For the ANOVA: "In order to have a better overview of the association of PCs with sample metadata, let’s perform ANOVA on each PC and see whether they are significantly associated with any technical or biological annotations of our samples"

I see in the example that there is no association between any PC and the condition column ("disease"). Even in the PCA plot, we don't distinguish between COVID-19 and Normal. Shouldn't we see some differences ?

PauBadiaM commented 5 months ago

Hi @Sophon-0,

Sorry I somehow missed this message. It seems like you are using the python version of decoupler, in the future its better to open an issue in its repository: https://github.com/saezlab/decoupler-py

Doing it per sample is fine, and as you did in the DESeq2 code too.

Regarding the PC question, not necessarily. PCA is telling us that the differences between cell types is bigger than between conditions, which is expected, a given cell type is not going to radically change and become a different one in disease (unless when working with cancer data), it is just going to express different molecular programs but it will maintain its cellular identity.

Again sorry for the delay on the reply and hope this was useful!

grst commented 5 months ago

Hi @PauBadiaM,

slighly related question that I also brought up on twitter: When testing a contrast, I could also compute the signatures with decoupler first (e.g. on logTPM) values and compare the signature scores using a linear model -- or first perform differential gene expression and apply decoupler on the fold changes (as you suggested above).

Do you have any opinion on what's better and why?

PauBadiaM commented 4 months ago

Hi @grst,

Thanks for the question! Both are valid, but I prefer performing first the contrast and then the enrichment because:

When performing sample/patient pseudobulking in single cell we use the actual true replicates instead of individual cells, which is more conservative due to the over-inflation of p-values.
Since DEA frameworks make it easy to correct for other covariates, it is something that you don't need to worry about during downstream analyses.
The statistical test to compare activity scores will depend on the distribution obtained from the specific enrichment method used. Should you use a t-test, wilcoxon, etc.? Instead, it is easier to use the ones implemented in DEA frameworks a priori.

Hope this is helpful!

saezlab / decoupleR

differential expression analysis and GSEA #119