Closed duocang closed 3 years ago
Two cells are not sufficient to do differential expression (DE) testing. I suggest you familiarize yourself with DE test methods for RNA-seq (DESeq2 is highly reccommended) in general, and methods for single-cell data in particular (here is a useful review/benchmarking study; and here a description of a method that is part of the sctransform package)
@ChristophH
Thank you! I will check DESeq2 for DE test.
How is the value of two genes in one cell (the 500 vs 100
question)?
seu[["SCT"]]@scale.data
is used in Seurat
to plot heatmap for exampe, I assume this indicates the same negative binomial distribution.
The values in scale.data
after running SCTransform are the Pearson residuals. They state how many standard deviations the observed count is above or below what one would expect if that gene was expressed at constant level across all cells.
But again, you cannot draw conclusions by comparing these values across just two cells. You have to either smooth your data or otherwise aggregate information across multiple cells, because for any given cell you sample only a small subset of all molecules. As a result each cell represents a different incomplete view its transcriptome. Whether a gene is detected and with what exact count will depend on the actual expression level of the gene and on the sequencing depth (and most likely some other technical factors).
The values in
scale.data
after running SCTransform are the Pearson residuals. They state how many standard deviations the observed count is above or below what one would expect if that gene was expressed at constant level across all cells. But again, you cannot draw conclusions by comparing these values across just two cells. You have to either smooth your data or otherwise aggregate information across multiple cells, because for any given cell you sample only a small subset of all molecules. As a result each cell represents a different incomplete view its transcriptome. Whether a gene is detected and with what exact count will depend on the actual expression level of the gene and on the sequencing depth (and most likely some other technical factors).
@ChristophH Hi.
Thank you for the patient explanation. I have checked scaling and normalization. Basically, these are what I understand SCTransform
is doing.
(Maybe very tiny help for others) Correct me if I am wrong.
Normalization is to remove the effect of sample/cells, e.g sequencing depth. So old Seurat
used NormalizeData
by LogNormalize
.
Scaling is to remove the gene effect (not really a good word). Some genes can have an extremely high expression which will be the domination in a heatmap for example. While we care about the variance of genes across cells, instead of one specific expression value. So old Seurat
used ScaleData
to get z-score
to scale expression values to a relatively reasonable range.
So far, I realize that genes between genes, such as Gene 1 vs Gene 2 vs Gene 3 .....
shall not follow the same statistic distribution, like normal distribution (apparently sc data does not follow a normal distribution). Thus it does not make sense to calculate a variance of (thousands) genes' expression in one cell, to compare with other cells' variance.
Do you have some ideas or methods to rank genes by their expression across cells?
Again, you have to aggregate expression across multiple cells, either by averaging within a cluster, or some local neighborhood. As expression I would use the normalized counts (counts
slot of SCTransform output) or the log1p-transformed normalized counts (data
slot of SCTransform output). Both are on an absolute scale and not relative (like scale.data
).
Hi.
I was a bit confused about the purpose/meaning of
SCTransform
because of the lack of statistic knowledge.I try to compare two cells, which cell is more active/variant. I try to get the variance of genes in each cell.
For
Gene2
,500 vs 300
is for sure meaning because of one gene. I am confident to sayGene2
is more expressed inCell 1
.500 vs 1000
? As far as I understand,Gene 2
andGene 3
should follow the same statistic distribution afterSCTransfomr
(the negative binomial distribution). If it was the normal distribution, we sayGene 2
andGene 3
have the same mean and variance after normalization.V1 vs V2
meaningful? Or aren't they comparable at all (statistically or biologically)? If they follow the same negative binomial distribution, I think it is meaningful.Thank you! DuoCang