immunogenomics / harmony

Fast, sensitive and accurate integration of single-cell data with Harmony
https://portals.broadinstitute.org/harmony/
Other
513 stars 98 forks source link

Reordering PCs? #245

Closed yesitsjess closed 4 months ago

yesitsjess commented 6 months ago

Hi, this is not an issue but more a question relating to single cell data analysis in Seurat. I think it is just a more general question about integration so I've posted here. Please let me know if I should be asking somewhere else.

After integration with harmony some of my earlier PCs explain less variance than later PCs. elbow_harmony_40dims

I get this by running:

obj <- RunHarmony(obj, group.by.vars=c("samp.no"), reduction="pca", assay.use="SCT", reduction.save="harmony")
ElbowPlot(obj, ndims=40, reduction="harmony")

To select the number of PCs to use for UMAP/TSNE/clustering I take the larger value of: 1 .The point where the principal components only contribute 5% of standard deviation and the principal components cumulatively contribute 90% of the standard deviation.

  1. The point where the percent change in variation between the consecutive PCs is less than 0.1%.

I then use this pcs (in this case 27) for the following:

obj <- RunTSNE(obj, reduction.name="integrated_tsne", reduction="harmony", assay="SCT", dims=1:pcs)
obj <- RunUMAP(obj, reduction.name="integrated_umap", reduction="harmony", assay="SCT", dims=1:pcs)
obj <- FindNeighbors(obj, reduction="harmony", dims=1:pcs, graph.name=c("int_SCT_nn", "int_SCT_snn"))
obj <- FindClusters(obj, graph.name="int_SCT_snn", resolution=seq(0.2, 2, 0.2), cluster.name=paste0("integrated_SCT_snn_res", seq(0.2, 2, 0.2)))

But I'm wondering if I should be reordering my PCs based on explained variance after harmony integration? And then selecting the specific PCs to use (rather than using all of 1 to 27)?

ilyakorsunsky commented 4 months ago

Hi @yesitsjess, this is a really interesting philosophical question. Typically, PCs are ordered by variance and so we "take the top N PCs" for feature selection. However, you don't need to necessarily select consecutive PCs. Some users of Harmony with scATACseq carefully select which non-consecutive PCs to use for Harmony and downstream analysis, carefully investigating which PCs encode biology vs noise. For your purposes, I would suggest using whichever method you like to select the relevant PCs after Harmony and then use those potentially non-consecutive PCs in all your downstream tasks.

pcs_idx = my_selection_method(...)
obj <- RunTSNE(obj, reduction.name="integrated_tsne", reduction="harmony", assay="SCT", dims=pcs_idx)