Closed tanasa closed 5 years ago
Hi,
I am not part of the team, but I may be able to answer some of the questions based on my experience.
Without running NormalizeData()
, running FindVariableGenes
will throw an error:
Error in seq.int(rx[1L], rx[2L], length.out = nb) : 'to' must be a finite number
Without running ScaleData()
, running RunPCA
will throw this error:
Error in GetAssayData(object, assay.type = assay.type, slot = "scale.data") : Object@scale.data has not been set. Run ScaleData() and then retry.
Now this will also happen even if NormalizeData()
and FindVariableGenes
are run.
So I guess FindVariableGenes
uses normalized data and RunPCA
uses scaled data.
I am sorry that I am not sure about CCA.
Thank you for your comments; yes, if the authors can advise us on these, it would be great.
1 after NormalizeData() function, why ScaleData() function is needed ?
NormalizeData() only accounts for the depth of sequencing in each cell (reads*10000 divide by total reads, and then log). ScaleData() zero-centres and scales it (See ?ScaleData). Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. Scaling is a routine thing to do for enhancing clustering or other analyses. (You may also like to see scale() function in R)
In the recent versions of Seurat, the ScaleData function is also used to regress out unwanted variables.
2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?
Everything on Scaled_Data. As I said, it facilitates the comparison across the genes. eg.
g1 10 20 30 40 50 g2 20 40 60 80 100
Although g2 has double the expression of g1, their pattern of expression is same, and scaling will "normalize" their expression so that they will cluster together.
3 is ScaleData() absolutely needed in the scRNA-seq analysis ?
Scaling is not inherent to scRNA-Seq. It is an important aspect of many machine learning / dimensional reduction algorithms where the distance between the features is compared. If you don't scale, the feature which has large range of variation might dominate/bias your analysis (because they will get large distances). Scaling "normalizes" this large variations among the features.
I think you are confused between Normaliztion and Scaling. Normalization "normalizes" within the cell for the difference in sequenicng depth / mRNA thruput. Scaling "normalizes" across the sample for differences in range of variation of expression of genes .
4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?
?RunCCA gives you the answer
RunCCA(object, object2, group1, group2, group.by, num.cc = 20, genes.use,
scale.data = TRUE, rescale.groups = FALSE, ...)
As you see, scale.data = TRUE .
Dear Santosh, thank you. Very helpful to understand the statistical design of the algorithm.
1 after NormalizeData() function, why ScaleData() function is needed ?
NormalizeData() only accounts for the depth of sequencing in each cell (reads*10000 divide by total reads, and then log). ScaleData() zero-centres and scales it (See ?ScaleData). Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. Scaling is a routine thing to do for enhancing clustering or other analyses. (You may also like to see scale() function in R)
In the recent versions of Seurat, the ScaleData function is also used to regress out unwanted variables.
2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?
Everything on Scaled_Data. As I said, it facilitates the comparison across the genes. eg.
g1 10 20 30 40 50 g2 20 40 60 80 100
Although g2 has double the expression of g1, their pattern of expression is same, and scaling will "normalize" their expression so that they will cluster together.
3 is ScaleData() absolutely needed in the scRNA-seq analysis ?
Scaling is not inherent to scRNA-Seq. It is an important aspect of many machine learning / dimensional reduction algorithms where the distance between the features is compared. If you don't scale, the feature which has large range of variation might dominate/bias your analysis (because they will get large distances). Scaling "normalizes" this large variations among the features.
I think you are confused between Normaliztion and Scaling. Normalization "normalizes" within the cell for the difference in sequenicng depth / mRNA thruput. Scaling "normalizes" across the sample for differences in range of variation of expression of genes .
4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?
?RunCCA gives you the answer
RunCCA(object, object2, group1, group2, group.by, num.cc = 20, genes.use, scale.data = TRUE, rescale.groups = FALSE, ...)
As you see, scale.data = TRUE .
Thanks for your reply You said "Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. ". That make sense but I get confused, why we usually use "data" slot but not "scale.data" slot to calculate or compare gene expression by VlnPlot and DotPlot.
You said "Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. ". That make sense but I get confused, why we usually use "data" slot but not "scale.data" slot to calculate or compare gene expression by VlnPlot and DotPlot.
This is because when you are doing VlnPlot, you are interested in seeing how the normalized expressions are distributed. It gives a better visual representation of the range and actual normalized expression values. Compare this with the fact that if you do with scaled data, you'll lose the actual normalized expression values and will be able to see only how many SD times the individual expression values are - which might be of little visual importance as you lose all the information of the individual values of the data points and overall range.
There is a slight mistake in my earlier answer. The scaling is usually done after centering the data, which means after subtracting the mean of the data from each data point. In Seurat, there is an option to not do neither scaling nor centering (although both are done by default). The scaling per se means dividing the values (of original or centered data) by SD (however, Seurat says that it divides by root mean square if data is not centered). See ?ScaleData
for full details
Dear Seurat authors and contributors,
as I have just started reading the documentation on SEURAT for scRNA-seq, I would appreciate having your answers and insights please on the following :
1 after NormalizeData() function, why ScaleData() function is needed ?
2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?
3 is ScaleData() absolutely needed in the scRNA-seq analysis ?
4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?
thanks a lot,
-- bogdan