Closed igordot closed 7 years ago
In general, we use object@scale.data
for functions that identify structure in the data, such as dimensionality reduction, as this will tend to give lowly and highly expressed genes equal weight. Values in object@scale.data
can therefore be negative, while values in object@data
are >=0.
For FindMarkers and AverageExpression, we want to either discover DE genes or compute in silico cluster averages, so using object@scale.data
would be inappropriate.
You are right that RunTSNE should have the option to run on scale.data (in most cases we don't compute tSNE on gene expression values, so this is a moot point). We will fix in an upcoming release.
Hi, I don't understand why using object@scale.data
for FindMarkers is inappropriate. Two cell groups from different libraries could have very different sequencing depth. Why is using object@data
default mode?
I'd like to hear the answer to lixin4306ren's question as well.
Along the same lines, what is your recommendation on which data type to use when one wants to apply quantitative filtering based on gene expression? For instance, I'd like to gate on CD3+CD4+ cells and to do that, I extracted a data.frame
of expression values with FetchData()
function. Within this data frame, I calculated boolean tests on raw.data
for adding new metadata information to appropriate cells (e.g. If CD3 > 0 and CD4 >0 annotate it as "CD4+ T-cell")
Would it be more appropriate here to use scaled.data
?
I'd like to hear the answer to lixin4306ren's question as well.
Along the same lines, what is your recommendation on which data type to use when one wants to apply quantitative filtering based on gene expression? For instance, I'd like to gate on CD3+CD4+ cells and to do that, I extracted a
data.frame
of expression values withFetchData()
function. Within this data frame, I calculated boolean tests onraw.data
for adding new metadata information to appropriate cells (e.g. If CD3 > 0 and CD4 >0 annotate it as "CD4+ T-cell")Would it be more appropriate here to use
scaled.data
?
Sorry to revive this old thread, but I am doing a similar analysis looking at 2 genes and had the exact same question. It looks like the scale.data slot has the pearson residuals, which can be positive and negative, while the data slot has the depth-corrected counts, which are only positive. Based off this, I assume using the data slot would be most appropriate, but I noticed the counts tend to have specific values, essentially looking like they have been binned, which worries me if I want to compare expression for cells within a sample as well as across samples.
Do you remember what you wound up doing and how you decided what was the most accurate way of accomplishing this? Much appreciated.
I'd like to hear the answer to lixin4306ren's question as well. Along the same lines, what is your recommendation on which data type to use when one wants to apply quantitative filtering based on gene expression? For instance, I'd like to gate on CD3+CD4+ cells and to do that, I extracted a
data.frame
of expression values withFetchData()
function. Within this data frame, I calculated boolean tests onraw.data
for adding new metadata information to appropriate cells (e.g. If CD3 > 0 and CD4 >0 annotate it as "CD4+ T-cell") Would it be more appropriate here to usescaled.data
?Sorry to revive this old thread, but I am doing a similar analysis looking at 2 genes and had the exact same question. It looks like the scale.data slot has the pearson residuals, which can be positive and negative, while the data slot has the depth-corrected counts, which are only positive. Based off this, I assume using the data slot would be most appropriate, but I noticed the counts tend to have specific values, essentially looking like they have been binned, which worries me if I want to compare expression for cells within a sample as well as across samples.
Do you remember what you wound up doing and how you decided what was the most accurate way of accomplishing this? Much appreciated.
I am also trying to learn this area, I think you are using V5, the Pearson residual is a newer method, the V3 in 2018 didnt have that.
Scaled.Data
is mainly used for PCA, structure or dim reduction, especially after UMAP (non-linear method), hence the exact low dimension values are not biological meaningful, only the relationship of the cells are useful for cell type identification and so on.
Data
is still carrying biological meaning since it is just log-normalized from counts. FindMarkers and AverageExpression use these to be as close as to true Counts
.
I think it is a balance game, if we do more transformation, we get far away from the truth, but the data can be cluster easier. So we use the most manipulated Scaled.Data
to do the clustering, but use the Data layer to find biological insights.
It does puzzle me why the count
data is almost never used, why keep it around in the object.
Edit: count
is used in this context, during FindMarkers
if test.use is "negbinom", "poisson", or "DESeq2", slot will be set to "counts"
With
RegressOut()
:Although
PCA()
andICA()
useobject@scale.data
, it looks likeRunTSNE()
usesobject@data
. Am I reading the code wrong or should that really be the case? Shouldn't both use the same values?In general, it seems that a lot of functions use
object@data
instead ofobject@scale.data
. For example,FindMarkers()
andAverageExpression()
. Shouldn't most downstream functions use the scaled data?