satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 912 forks source link

using object@data versus object@scale.data #62

Closed igordot closed 7 years ago

igordot commented 7 years ago

With RegressOut():

Seurat stores the z-scored residuals of these models in the scale.data slot, and they are used for dimensionality reduction and clustering.

Although PCA() and ICA() use object@scale.data, it looks like RunTSNE() usesobject@data. Am I reading the code wrong or should that really be the case? Shouldn't both use the same values?

In general, it seems that a lot of functions use object@data instead of object@scale.data. For example, FindMarkers() and AverageExpression(). Shouldn't most downstream functions use the scaled data?

satijalab commented 7 years ago

In general, we use object@scale.data for functions that identify structure in the data, such as dimensionality reduction, as this will tend to give lowly and highly expressed genes equal weight. Values in object@scale.data can therefore be negative, while values in object@data are >=0.

For FindMarkers and AverageExpression, we want to either discover DE genes or compute in silico cluster averages, so using object@scale.data would be inappropriate.

You are right that RunTSNE should have the option to run on scale.data (in most cases we don't compute tSNE on gene expression values, so this is a moot point). We will fix in an upcoming release.

lixin4306ren commented 6 years ago

Hi, I don't understand why using object@scale.data for FindMarkers is inappropriate. Two cell groups from different libraries could have very different sequencing depth. Why is using object@data default mode?

atakanekiz commented 6 years ago

I'd like to hear the answer to lixin4306ren's question as well.

Along the same lines, what is your recommendation on which data type to use when one wants to apply quantitative filtering based on gene expression? For instance, I'd like to gate on CD3+CD4+ cells and to do that, I extracted a data.frame of expression values with FetchData() function. Within this data frame, I calculated boolean tests on raw.data for adding new metadata information to appropriate cells (e.g. If CD3 > 0 and CD4 >0 annotate it as "CD4+ T-cell")

Would it be more appropriate here to use scaled.data?

Pmaj7 commented 1 year ago

I'd like to hear the answer to lixin4306ren's question as well.

Along the same lines, what is your recommendation on which data type to use when one wants to apply quantitative filtering based on gene expression? For instance, I'd like to gate on CD3+CD4+ cells and to do that, I extracted a data.frame of expression values with FetchData() function. Within this data frame, I calculated boolean tests on raw.data for adding new metadata information to appropriate cells (e.g. If CD3 > 0 and CD4 >0 annotate it as "CD4+ T-cell")

Would it be more appropriate here to use scaled.data?

Sorry to revive this old thread, but I am doing a similar analysis looking at 2 genes and had the exact same question. It looks like the scale.data slot has the pearson residuals, which can be positive and negative, while the data slot has the depth-corrected counts, which are only positive. Based off this, I assume using the data slot would be most appropriate, but I noticed the counts tend to have specific values, essentially looking like they have been binned, which worries me if I want to compare expression for cells within a sample as well as across samples.

Do you remember what you wound up doing and how you decided what was the most accurate way of accomplishing this? Much appreciated.

yinshiyi commented 6 months ago

I'd like to hear the answer to lixin4306ren's question as well. Along the same lines, what is your recommendation on which data type to use when one wants to apply quantitative filtering based on gene expression? For instance, I'd like to gate on CD3+CD4+ cells and to do that, I extracted a data.frame of expression values with FetchData() function. Within this data frame, I calculated boolean tests on raw.data for adding new metadata information to appropriate cells (e.g. If CD3 > 0 and CD4 >0 annotate it as "CD4+ T-cell") Would it be more appropriate here to use scaled.data?

Sorry to revive this old thread, but I am doing a similar analysis looking at 2 genes and had the exact same question. It looks like the scale.data slot has the pearson residuals, which can be positive and negative, while the data slot has the depth-corrected counts, which are only positive. Based off this, I assume using the data slot would be most appropriate, but I noticed the counts tend to have specific values, essentially looking like they have been binned, which worries me if I want to compare expression for cells within a sample as well as across samples.

Do you remember what you wound up doing and how you decided what was the most accurate way of accomplishing this? Much appreciated.

I am also trying to learn this area, I think you are using V5, the Pearson residual is a newer method, the V3 in 2018 didnt have that.

Scaled.Data is mainly used for PCA, structure or dim reduction, especially after UMAP (non-linear method), hence the exact low dimension values are not biological meaningful, only the relationship of the cells are useful for cell type identification and so on.

Data is still carrying biological meaning since it is just log-normalized from counts. FindMarkers and AverageExpression use these to be as close as to true Counts.

I think it is a balance game, if we do more transformation, we get far away from the truth, but the data can be cluster easier. So we use the most manipulated Scaled.Data to do the clustering, but use the Data layer to find biological insights.

It does puzzle me why the count data is almost never used, why keep it around in the object. Edit: count is used in this context, during FindMarkers if test.use is "negbinom", "poisson", or "DESeq2", slot will be set to "counts"