satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
203 stars 33 forks source link

Can I use the value processed by SCTransform to calcuate variance of genes in one cell? #95

Closed duocang closed 3 years ago

duocang commented 3 years ago

Hi.

I was a bit confused about the purpose/meaning of SCTransform because of the lack of statistic knowledge.

I try to compare two cells, which cell is more active/variant. I try to get the variance of genes in each cell.

Gene Cell 1 Cell 2
Gene 1 1 2
Gene 2 500 300
Gene 3 1000 8
Variance v1 (8.8) v2 (3.3)

The numbers in the table above are the results of simulated SCTransform calculations

For Gene2, 500 vs 300 is for sure meaning because of one gene. I am confident to say Gene2 is more expressed in Cell 1.

  1. What about 500 vs 1000? As far as I understand, Gene 2 and Gene 3 should follow the same statistic distribution after SCTransfomr (the negative binomial distribution). If it was the normal distribution, we say Gene 2 and Gene 3 have the same mean and variance after normalization.
  2. Is V1 vs V2 meaningful? Or aren't they comparable at all (statistically or biologically)? If they follow the same negative binomial distribution, I think it is meaningful.

Thank you! DuoCang

ChristophH commented 3 years ago

Two cells are not sufficient to do differential expression (DE) testing. I suggest you familiarize yourself with DE test methods for RNA-seq (DESeq2 is highly reccommended) in general, and methods for single-cell data in particular (here is a useful review/benchmarking study; and here a description of a method that is part of the sctransform package)

duocang commented 3 years ago

@ChristophH

Thank you! I will check DESeq2 for DE test.

How is the value of two genes in one cell (the 500 vs 100 question)?
seu[["SCT"]]@scale.data is used in Seurat to plot heatmap for exampe, I assume this indicates the same negative binomial distribution.

ChristophH commented 3 years ago

The values in scale.data after running SCTransform are the Pearson residuals. They state how many standard deviations the observed count is above or below what one would expect if that gene was expressed at constant level across all cells. But again, you cannot draw conclusions by comparing these values across just two cells. You have to either smooth your data or otherwise aggregate information across multiple cells, because for any given cell you sample only a small subset of all molecules. As a result each cell represents a different incomplete view its transcriptome. Whether a gene is detected and with what exact count will depend on the actual expression level of the gene and on the sequencing depth (and most likely some other technical factors).

duocang commented 3 years ago

The values in scale.data after running SCTransform are the Pearson residuals. They state how many standard deviations the observed count is above or below what one would expect if that gene was expressed at constant level across all cells. But again, you cannot draw conclusions by comparing these values across just two cells. You have to either smooth your data or otherwise aggregate information across multiple cells, because for any given cell you sample only a small subset of all molecules. As a result each cell represents a different incomplete view its transcriptome. Whether a gene is detected and with what exact count will depend on the actual expression level of the gene and on the sequencing depth (and most likely some other technical factors).

@ChristophH Hi.

Thank you for the patient explanation. I have checked scaling and normalization. Basically, these are what I understand SCTransform is doing.

(Maybe very tiny help for others) Correct me if I am wrong. Normalization is to remove the effect of sample/cells, e.g sequencing depth. So old Seurat used NormalizeData by LogNormalize. Scaling is to remove the gene effect (not really a good word). Some genes can have an extremely high expression which will be the domination in a heatmap for example. While we care about the variance of genes across cells, instead of one specific expression value. So old Seurat used ScaleDatato get z-score to scale expression values to a relatively reasonable range.

So far, I realize that genes between genes, such as Gene 1 vs Gene 2 vs Gene 3 ..... shall not follow the same statistic distribution, like normal distribution (apparently sc data does not follow a normal distribution). Thus it does not make sense to calculate a variance of (thousands) genes' expression in one cell, to compare with other cells' variance.

Do you have some ideas or methods to rank genes by their expression across cells?

ChristophH commented 3 years ago

Again, you have to aggregate expression across multiple cells, either by averaging within a cluster, or some local neighborhood. As expression I would use the normalized counts (counts slot of SCTransform output) or the log1p-transformed normalized counts (data slot of SCTransform output). Both are on an absolute scale and not relative (like scale.data).