hms-dbmi / scde

R package for analyzing single-cell RNA-seq data
http://pklab.med.harvard.edu/scde
Other
170 stars 64 forks source link

Genes with 0 expression in one group are not DE when same groups are used for error models #23

Closed Carldeboer closed 8 years ago

Carldeboer commented 8 years ago

Hi I noticed that when comparing cell types where one group does not express a gene at all (0 counts) and the other expresses it, this gene often fails to come up as differentially expressed in spite of its (often large) difference in expression. I noticed this when one of the most DE genes in my set was not coming up as significant (Igj), so I made this example where I blank out one half for each of two very highly expressed genes (Rplp0 and Actb). In all three cases, when the same grouping is used for the error models as the DE test, the significance is much smaller than when the grouping is not provided to the error model. In the case where the error model knows about the groups, these "binary" genes are not even near the top of the DE gene list. In fact, for Igj, it is not DE at all (FC=0, Z=0) despite it being highly expressed in one group (variably so) and 0 in the other).

##SCDE
#add controls
combinedOMsMat["Rplp0",names(combinedIsWeirdo)[combinedIsWeirdo=="TRUE"]]=0;
combinedOMsMat["Actb",names(combinedIsWeirdo)[combinedIsWeirdo=="FALSE"]]=0;
storage.mode(combinedOMsMat) <- "integer";

#initial grouping for error models
combinedErrModel = scde.error.models(counts = combinedOMsMat, groups = combinedIsWeirdo, n.cores = 2,  save.crossfit.plots = FALSE, save.model.plots = FALSE, verbose = 1); #threshold.segmentation = TRUE,
combinedPrior <- scde.expression.prior(models = combinedErrModel, counts = combinedOMsMat, length.out = 400, show.plot = FALSE)
combinedDE <- scde.expression.difference(combinedErrModel, combinedOMsMat, combinedPrior, groups  =  combinedIsWeirdo, n.randomizations  =  1000, n.cores  =  2, verbose  =  1)

################ NO INITIAL GROUPING
combinedErrModelNoG = scde.error.models(combinedOMsMat, n.cores = 2)
combinedPriorNoG <- scde.expression.prior(models = combinedErrModelNoG, counts = combinedOMsMat)
combinedDENoG <- scde.expression.difference(combinedErrModelNoG, combinedOMsMat, combinedPriorNoG, groups  =  combinedIsWeirdo, n.randomizations  =  1000, n.cores  =  2, verbose  =  1)

message("DE results for controls when an initial grouping is used");
combinedDE[c("Igj","Actb","Rplp0"),];

message("DE results for controls when NO initial grouping is used");
combinedDENoG[c("Igj","Actb","Rplp0"),];

Here is the output:

cross-fitting cells.
number of pairs:  946
number of pairs:  351
total number of pairs:  1297
cross-fitting 1297 pairs:
building individual error models.
adjusting library size based on 2000 entries
fitting FALSE models:
fitting TRUE models:
comparing groups:

FALSE  TRUE
   44    27
calculating difference posterior
summarizing differences
comparing groups:

FALSE  TRUE
   44    27
calculating difference posterior
summarizing differences
DE results for controls when an initial grouping is used
               lb       mle        ub        ce         Z         cZ
Igj   -3.74336039  0.000000  0.000000  0.000000  0.000000  0.0000000
Actb  -5.79914027 -2.485346 -0.767082 -0.767082 -2.667879 -1.2791274
Rplp0 -0.03068328  0.000000  6.627589  0.000000  1.740882  0.4774509
DE results for controls when NO initial grouping is used
             lb       mle        ub        ce         Z        cZ
Igj   -12.22303 -12.22303 -11.88690 -11.88690 -7.160809 -6.379707
Actb  -12.22303 -12.22303 -11.30630 -11.30630 -7.160847 -6.379707
Rplp0  11.61188  12.22303  12.22303  11.61188  7.160813  6.379707
JEFworks commented 8 years ago

Hi Carl,

It's difficult for me to get a sense of the problem without the data. But have you looked at the cell model fits when you provide the group labels versus when you omit the group labels? You can do this by setting save.model.plots = TRUE.

When groups are provided, error models are built using only cells in the same group, thereby only making pairwise comparisons between cells of that group to derive the expected gene expression magnitudes and drop-out rates. Otherwise, all cells are pooled together. So my best guess without seeing your data is that the derived error models are substantially different for some reason if you're seeing downstream differences in the differential expression analysis.

Also, have you tried looking at your Igj gene in more detail via: scde.test.gene.expression.difference("Igj", models = o.ifm, counts = cd, prior = o.prior) for either error models?

Best, Jean

Carldeboer commented 8 years ago

My data are here: http://www.broadinstitute.org/~cgdeboer/20151210_plasma_cell_DE_comparison_justReq.RData

You should be able to load that and run the code above directly.

I have not looked at the resulting error models. I agree that this is likely where the issue lies.

JEFworks commented 8 years ago

Hi Carl,

I've been able to replicate your problem and determine the cause.

The source of the problem was that the group labels you provided in the first part of your code is not in the same order as the error models. Note the difference in the 3 following ways of running scde.expression.difference:

# groups = NULL so taken from error model                                                                                                  
combinedDE <- scde.expression.difference(combinedErrModel, combinedOMsMat, combinedPrior, n.cores  =  10, verbose  =  1)         
# groups in the wrong order                                                                                                      
combinedDE2 <- scde.expression.difference(combinedErrModel, combinedOMsMat, combinedPrior, groups = combinedIsWeirdo, n.cores  =  10, verbose  =  1)                                                                                                             
# groups in right order                                                                                                          
combinedDE3 <- scde.expression.difference(combinedErrModel, combinedOMsMat, combinedPrior, groups = combinedIsWeirdo[rownames(combinedErrModel)], n.cores  =  10, verbose  =  1)                                                                                 

Note that the results for combinedDE and combinedDE3 are quite similar to your combinedDENoG as expected.

> combinedDE[c("Igj","Actb","Rplp0"),];                                                                                         
              lb       mle        ub        ce         Z        cZ                                                               
 Igj   -12.27471 -12.27471 -11.96784 -11.96784 -7.160847 -6.447212                                                               
 Actb  -12.27471 -12.27471 -11.75304 -11.75304 -7.160809 -6.447212                                                               
 Rplp0  11.59960  12.27471  12.27471  11.59960  7.160813  6.447212                                                               
 > combinedDE2[c("Igj","Actb","Rplp0"),];                                                                                        
                lb       mle         ub         ce         Z         cZ                                                          
 Igj   -1.22747122  0.000000  0.0000000  0.0000000  0.000000  0.0000000                                                          
 Actb  -6.38285035 -2.025328 -0.7364827 -0.7364827 -2.556040 -1.2923053                                                          
 Rplp0 -0.03068678  0.000000  6.4135371  0.0000000  1.942183  0.7596449                                                          
 > combinedDE3[c("Igj","Actb","Rplp0"),];                                                                                        
              lb       mle        ub        ce         Z        cZ                                                               
 Igj   -12.27471 -12.27471 -11.96784 -11.96784 -7.160847 -6.447212                                                               
 Actb  -12.27471 -12.27471 -11.75304 -11.75304 -7.160809 -6.447212                                                               
 Rplp0  11.59960  12.27471  12.27471  11.59960  7.160813  6.447212  

I will try to update the script to automatically orient the group labels in the same order as the error models when possible to prevent such errors in the future.

Best, Jean