Interpreting Results from SCPA Output

jackbibby1 / SCPA

R package for pathway analysis in scRNA-seq data

https://jackbibby1.github.io/SCPA/

GNU General Public License v3.0

62 stars 6 forks source link

Interpreting Results from SCPA Output #18

Closed troks27 closed 1 year ago

troks27 commented 1 year ago

Hello,

I really appreciate your work on this package. I'm trying to get oriented on how to interpret the outputs from SCPA. My experimental setup is as follows: 2 conditions, DIAB and NG, with clusters 0-23 in each condition. I would like to compare DIAB vs NG of each cell type, thus resulting in 24 pairwise comparisons. I have run the analysis both ways, so that DIAB is + FC and NG is + FC. However, in each of the comparisons, some of the same pathways have q-values above 0.

1) Thus, how am I supposed to know in which condition the pathway is actually changing?

2) Also, how do I select a minimum q-value cutoff for pathways that would be interesting?

Thank you so much for the guidance.

Code:

Idents(seurat_integrated_sct) <- "seurat_clusters"

pathways <- c("gobp", "kegg", "reactome")
sets <- msigdbr("Mus musculus") %>%
  filter(grepl(paste(pathways, collapse = "|"), gs_name, ignore.case = T)) %>%
  format_pathways()

seurat_integrated_sct_list <- SplitObject(seurat_integrated_sct, split.by = "sample")

scpa_out <- list()
for (i in 0:23) {

  NG_path <- seurat_extract(seurat_integrated_sct_list$NG, 
                            meta1 = "seurat_clusters", value_meta1 = i)

  DIAB_path <- seurat_extract(seurat_integrated_sct_list$DIAB, 
                              meta1 = "seurat_clusters", value_meta1 = i)

  print(paste("comparing", i))
  scpa_out[[i+1]] <- compare_pathways(list(DIAB_path, NG_path), sets)
  write.csv(scpa_out[[i+1]],paste0(overall_filepath,"/scpa_out_Cluster ",i,".csv"))
}

jackbibby1 commented 1 year ago

Hi,

Thanks -- hope you're finding useful. The qvals you get out of SCPA will always be positive because they're a transformation of the adjusted pval, and I would just interpret it being the larger the qval, the larger the pathway change between conditions.

For the first point -- in the compare_pathways() output, there's a fold change value that tells you if the pathway is changing in a particular direction i.e. if the genes of the pathway are enriched in a particular group. This is calculated as population1 - population2 in the comparison. I would say that you should only use this as a secondary informative value of pathway activity, and typically just use the qval as your statistic though, as SCPA assesses multivariate distributions as a primary statistic, which takes into account enrichment along with other measures. So here you will get a number of pathways that have large qvals, but no fold change, meaning that this pathway gene expression change is independent of overall mean enrichment e.g. high variance in one pathway versus the other. This is best seen in Figure 4D-J of our paper here, where arachodinic acid had a high qval, but no fold change, but is still shown to be important for T cell activation.

For the second -- you can use the adjusted pval in the SCPA output to judge pathway significance as a simple statistical filter (roughly translating to a qval of > 3). In general, I think the most informative approach is to plot the qvals and just consider pathways with the largest qvals as the ones that are most interesting for your conditions (so long as they're statistically significant).

Hope that helps. Just let me know if anything isn't clear

Jack

troks27 commented 1 year ago

Hey Jack,

Thanks for the reply. How would you recommend plotting the q-values? Using the plot_rank function?

Also, above, when you say "interpret it being the larger the qval, the larger the pathway change between conditions." And I know you said that FC is a secondary qualifier for the direction of change. So in general, if the outputs from SCPA are just telling me that a pathway is most likely changing, how can it tell me also if that pathway is altered more in population 1 or population 2?

jackbibby1 commented 1 year ago

Yup, you can use that -- it really just depends on what you're wanting to show. If you're wanting to visualise the comparison of a pathway(s) in a specific cluster/population, then I imagine the plot_rank() will be useful. If you want to visualise all your comparisons, then I would probably use something like the plot_heatmap() function. These were meant as fairly basic but useful visualisations of the overall output, so maybe another custom plot will be better in your case, depending on what you want to show.

I think your second question may just be more of a semantic point and depends on the question you're asking. For example, in a simple comparison of healthy/control vs disease, then by definition the pathway is 'altered' in the disease. If you just have two scenarios with a less obvious control/baseline group e.g. spleen vs lung tissue, then all you need to know is that pathway x in cluster y is different between the tissues... here it doesn't make much sense to say that a pathway is more different in spleen than lung because you're effectively using them both as references to each other (this is the same in any gene/pathway analysis and not unique to SCPA). Alternatively, if your question in the above comparison is asking what happens to a cell after migrating from the spleen into the lung (because you know that is happening), then you can say that pathway x changes when cluster y migrates into the lung.

troks27 commented 1 year ago

Awesome. Thank you again for all of the help. I am now going through and am attempting to analyze the output. I have generated a graph of the Q-Values for all of the clusters, as well as a custom heat map that has all of the Pathways with an Adjpval < 0.05.

I then exported each of the clusters from the heatmap (for example, the top of the Heatmap Cluster 3 output below). Do you have a recommendation on how I would go about ascertaining actionable information from these long lists of pathways?

It seems that there are many odd pathways that appear in the outputs. For example, the cell type in question is a neutrophil, but some of the highest ranked pathways include "GOBP_ISOTYPE_SWITCHING_TO_IGG_ISOTYPES," "GOBP_MYELIN_ASSEMBLY," etc, including many other pathways that have nothing to do with neutrophils. I was wondering what may be introducing all of this noise?

Q-Values.pdf Adjpval<0.05 Heatmap.pdf

jackbibby1 commented 1 year ago

RE the odd pathways -- this could be a few different things, but is just generally an issue with having complex and overlapping systems (with the assumption that the cell identification from the clustering is accurate). It could come from pathways that have some overlapping genes e.g. genes that regulate isotope switching in B cells and also neutrophil degranulation in neutrophils (I normally find this is the most common explanation). It could also be that these sets of genes are just better defined as regulators of certain processes in certain cells and they haven't been studied as much in others. It could be that the pathway is quite small, and then you're more likely to introduce noise from a few genes -- in the default parameters of the compare_pathways() comparison we filter out gene sets that have fewer than 15 genes to try to counteract this. Just looking at the couple of gene sets that you mention, these are on the edge of the cut-off (having 22 and 15 genes in them), so this could be the explanation.

RE interpretation -- this is really just down to your judgement and question you're asking. In Fig 5D and F of our paper here, we wanted to find a signature that was specific to a certain tissue/cell type, so we looked at the pathway that had the highest variance across all conditions (e.g. apply(df, 1, var)). If you just want the most statistically interesting signatures, you can take some classical approach and use the topmost pathways to decide on what makes sense in your system. It looks like you're just using the GOBP gene sets here? I would probably use the Hallmark/KEGG/Reactome pathways -- I think these are typically more helpful to interpret, and are generally less noisy

Ramirkhah commented 1 year ago

Hi Jack, Thanks so much for developing this package. I did comparison between two groups of cells Wild-type and Mutated:

wt_mut <- compare_pathways(samples = list(wt, mut),
                             pathways = hkr_sets)

## Visualizing results
wt_mut.1 <- wt_mut %>%
  mutate(color = case_when(FC > 5 & adjPval < 0.01 ~ '#6dbf88',
                           FC < 5 & FC > -5 & adjPval < 0.01 ~ '#84b0f0',
                           FC < -5 & adjPval < 0.01 ~ 'mediumseagreen',
                           FC < 5 & FC > -5 & adjPval > 0.01 ~ 'black'))

This is the result Rplot How I can interpret this plot, in terms of positive and negative enrichment. Are the pathways in the right site positively enriched in mut group?! How did you decide on the dash line? is everything between two dash lines showing significant pathways with no enrichment? In my case should I increase it to higher value rather 5?

Really appreciate your help.

jackbibby1 commented 1 year ago

Hi,

There's some information on this if you run ?compare_pathways()... "If only two samples are provided, a fold change (FC) enrichment score will also be calculated. The FC statistic is generated from a running sum of mean changes in gene expression from all genes of the pathway. It's calculated from average pathway expression in population1 - population2, so a negative FC means the pathway is higher in population2." So your left side should be enriched in your mutant population here.

The dashed line I included at 5 on the manuscript was somewhat arbitrary, but this translates to a 5 fold difference in pathway expression in the above calculation, which we considered 'enriched'. Yup, you're right -- we only included this to make a point that some pathways are significantly differentially distributed, but not enriched in any direction. I wouldn't get too worried about assigning a fold change cut off, and instead just rank pathways based on the qvals generated from SCPA.

Jack