satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.24k stars 902 forks source link

Clusters with different proportions - Data Integration #8797

Closed jcasadogp closed 4 months ago

jcasadogp commented 4 months ago

Hello,

I am running a single-cell analysis. I want to compare three conditions: A, B, C. These conditions seem to be the main differentiating parameter as the UMAP shows three clearly separated islands of cells. When I cluster data I obtain clusters that are exclusive for one of these three conditions and increasing resolutions does not solve it.

treatment_cluster_resolutions treatment_cluster_resolutions_barplots

What can I do to continue the analysis? I want to do DGE and pathway analysis to compare these three conditions.

Thank you in advance!

morallawwithin commented 4 months ago

You need to integrate for conditions and rerun the PCA and clustering on the integrated datasest:

https://satijalab.org/seurat/articles/integration_introduction.html

jcasadogp commented 4 months ago

Thank you very much @morallawwithin, I did not know this functionality.

Does this mean that having different proportions on each cluster per sample is bad prior to DGE and pathway analysis? Should I always integrate the condition I want to compare before further analysis? Or cluster proportions are "irrelevant" in order to read out DGE results?

For example, in the following image, I want to do cluster analysis and compare EU vs US samples, would it be correct to do it with these proportions? In analysis like this I may get that something (gene X or pathway Y) is, in cluster 0, upregulated in EU samples. How would that be possible if cluster 0 is more present on US samples? Am I missing something?

cluster_proportions

Thank you very much,

Julia

morallawwithin commented 4 months ago

You should always do integration for samples with different conditions/backgrounds. You will get new clusters that will be more even in proportion.

Seurat does not have tools to analyse composition (yet). A completly different single cell package named Cacoa does however, if you are interested into that.

Still I recommend integration.

jcasadogp commented 4 months ago

Thank you very much, I think it makes sense to do it like that!

Do you have any paper or official tutorial that supports that procedure? I would need to justify my decision.

morallawwithin commented 4 months ago

https://doi.org/10.1038/s41576-023-00586-w for the need and other tools for compositional analysis.

https://www.biorxiv.org/content/10.1101/2022.03.15.484475v1 for Cacoa

mhkowalski commented 4 months ago

Hi,

We like to think of integration as cell type harmonization. We use integration to find the most similar cell types across datasets, and then perform differential expression on clusters identified after integration. This way you are learning the effect of your condition of interest on gene expression across similar cell types (in contrast to bulk RNA-seq, where you may conflate gene expression changes with changes in cell type composition). I would not recommend performing differential expression on clusters that are clearly driven by the condition effect.

It is ok if the clusters aren't 100% balanced, as the bar graph you show indicates. That might represent true variability in cell type composition that you could analyze with other tools, as @morallawwithin points out. It's still valid to perform differential expression these clusters, though you should convince yourself that this clustering is not driven by a technical effect.