satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 914 forks source link

about NormalizeData, ScaleData, PCA, CCA .. #950

Closed tanasa closed 5 years ago

tanasa commented 5 years ago

Dear Seurat authors and contributors,

as I have just started reading the documentation on SEURAT for scRNA-seq, I would appreciate having your answers and insights please on the following :

1 after NormalizeData() function, why ScaleData() function is needed ?

2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

3 is ScaleData() absolutely needed in the scRNA-seq analysis ?

4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?

thanks a lot,

-- bogdan

bwang258 commented 5 years ago

Hi,

I am not part of the team, but I may be able to answer some of the questions based on my experience.

Without running NormalizeData(), running FindVariableGenes will throw an error: Error in seq.int(rx[1L], rx[2L], length.out = nb) : 'to' must be a finite number

Without running ScaleData(), running RunPCA will throw this error: Error in GetAssayData(object, assay.type = assay.type, slot = "scale.data") : Object@scale.data has not been set. Run ScaleData() and then retry.

Now this will also happen even if NormalizeData() and FindVariableGenes are run.

So I guess FindVariableGenes uses normalized data and RunPCA uses scaled data.

I am sorry that I am not sure about CCA.

tanasa commented 5 years ago

Thank you for your comments; yes, if the authors can advise us on these, it would be great.

sansense commented 5 years ago

1 after NormalizeData() function, why ScaleData() function is needed ?

NormalizeData() only accounts for the depth of sequencing in each cell (reads*10000 divide by total reads, and then log). ScaleData() zero-centres and scales it (See ?ScaleData). Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. Scaling is a routine thing to do for enhancing clustering or other analyses. (You may also like to see scale() function in R)

In the recent versions of Seurat, the ScaleData function is also used to regress out unwanted variables.

2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

Everything on Scaled_Data. As I said, it facilitates the comparison across the genes. eg.

g1 10 20 30 40 50 g2 20 40 60 80 100

Although g2 has double the expression of g1, their pattern of expression is same, and scaling will "normalize" their expression so that they will cluster together.

3 is ScaleData() absolutely needed in the scRNA-seq analysis ?

Scaling is not inherent to scRNA-Seq. It is an important aspect of many machine learning / dimensional reduction algorithms where the distance between the features is compared. If you don't scale, the feature which has large range of variation might dominate/bias your analysis (because they will get large distances). Scaling "normalizes" this large variations among the features.

I think you are confused between Normaliztion and Scaling. Normalization "normalizes" within the cell for the difference in sequenicng depth / mRNA thruput. Scaling "normalizes" across the sample for differences in range of variation of expression of genes .

4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?

?RunCCA gives you the answer

RunCCA(object, object2, group1, group2, group.by, num.cc = 20, genes.use,
  scale.data = TRUE, rescale.groups = FALSE, ...)

As you see, scale.data = TRUE .

tanasa commented 5 years ago

Dear Santosh, thank you. Very helpful to understand the statistical design of the algorithm.

DiracZhu1998 commented 3 years ago

1 after NormalizeData() function, why ScaleData() function is needed ?

NormalizeData() only accounts for the depth of sequencing in each cell (reads*10000 divide by total reads, and then log). ScaleData() zero-centres and scales it (See ?ScaleData). Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. Scaling is a routine thing to do for enhancing clustering or other analyses. (You may also like to see scale() function in R)

In the recent versions of Seurat, the ScaleData function is also used to regress out unwanted variables.

2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

Everything on Scaled_Data. As I said, it facilitates the comparison across the genes. eg.

g1 10 20 30 40 50 g2 20 40 60 80 100

Although g2 has double the expression of g1, their pattern of expression is same, and scaling will "normalize" their expression so that they will cluster together.

3 is ScaleData() absolutely needed in the scRNA-seq analysis ?

Scaling is not inherent to scRNA-Seq. It is an important aspect of many machine learning / dimensional reduction algorithms where the distance between the features is compared. If you don't scale, the feature which has large range of variation might dominate/bias your analysis (because they will get large distances). Scaling "normalizes" this large variations among the features.

I think you are confused between Normaliztion and Scaling. Normalization "normalizes" within the cell for the difference in sequenicng depth / mRNA thruput. Scaling "normalizes" across the sample for differences in range of variation of expression of genes .

4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?

?RunCCA gives you the answer

RunCCA(object, object2, group1, group2, group.by, num.cc = 20, genes.use,
  scale.data = TRUE, rescale.groups = FALSE, ...)

As you see, scale.data = TRUE .

Thanks for your reply You said "Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. ". That make sense but I get confused, why we usually use "data" slot but not "scale.data" slot to calculate or compare gene expression by VlnPlot and DotPlot.

sansense commented 3 years ago

You said "Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. ". That make sense but I get confused, why we usually use "data" slot but not "scale.data" slot to calculate or compare gene expression by VlnPlot and DotPlot.

This is because when you are doing VlnPlot, you are interested in seeing how the normalized expressions are distributed. It gives a better visual representation of the range and actual normalized expression values. Compare this with the fact that if you do with scaled data, you'll lose the actual normalized expression values and will be able to see only how many SD times the individual expression values are - which might be of little visual importance as you lose all the information of the individual values of the data points and overall range.

There is a slight mistake in my earlier answer. The scaling is usually done after centering the data, which means after subtracting the mean of the data from each data point. In Seurat, there is an option to not do neither scaling nor centering (although both are done by default). The scaling per se means dividing the values (of original or centered data) by SD (however, Seurat says that it divides by root mean square if data is not centered). See ?ScaleData for full details