kdzimm / PseudoreplicationPaper

Code used to carry out parameter estimation, correlation estimation, type 1 error analysis, and power analysis for our "Pseudoreplication in Single-Cell" study
MIT License
19 stars 5 forks source link

simulate() error when running Power.Rmd script #2

Closed jgamache014 closed 3 years ago

jgamache014 commented 3 years ago

Hello @kdzimm et al.,

Thank you for providing this excellent resource for incorporating a random effect for individual when running MAST for single-cell differential expression analyses. After loading all of the necessary packages, I started building the simulation as in 'Power.Rmd' but got an error at line 27. Here is my full script in R version 4.0.3:

library(ggplot2) 
library(fitdistrplus) 
library(MASS) 
library(tidyr) 
library(gdata)
library(Seurat)
library(data.table)
library(EnvStats)
library(purrr)
library(dplyr)
library(sn)
library(matrixStats)
library(fmsb)

ngenes <- 100
n.per.group <- 5
ncontrols <- n.per.group
ncases <- n.per.group
mean.number.cells.per.person <- 150
ncells.per.control <- rpois(n=ncontrols,lambda=mean.number.cells.per.person)
ncells.per.case <- rpois(n=ncases,lambda=mean.number.cells.per.person)
ncells <- sum(ncells.per.case) + sum(ncells.per.control)
allgenes <- as.data.frame(replicate(ngenes,simulate(foldchange=2)))

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': no applicable method for 'simulate' applied to an object of class "c('double', 'numeric')"

It seems that the format of the simulate() argument is not correct, and I am having trouble determining what was intended here.

Thanks for your help!

kdzimm commented 3 years ago

@jgamache014

Thanks for taking a look at things! Happy to help. It looks like you have not built the "simulate" function in earlier steps. The steps in the "Simulation Example" under "Parameter Estimation" must be run first before running the "Power" script. Please, let me know if you encounter any additional errors. What are your goals for proceeding? Are you just trying to simulate data or compute power curves? If so, try running using the hierarchicell package (https://github.com/kdzimm/hierarchicell - recently released and is more recently updated than this code). I'd recommend following these steps to get to simulated data:

library(hierarchicell) dat0 <- filter_counts() data_summaries <- compute_data_summaries(dat0) simulated_counts <- simulate_hierarchicell(data_summaries)

jgamache014 commented 3 years ago

Hi @kdzimm, sorry for the late response. Running the simulation example first worked - thanks! I looked at this to get a sense of how you set up the objects for running the zlmCond analysis in the Power.rmd script. I have single-cell RNA-seq data from 24 samples that I want to analyze with MAST, incorporating a random effect for sample ID. I've been using Seurat for data processing, but the Seurat implementation of MAST in the FindMarkers() function for differential expression does not enable random effect modeling.

When I ran your script with my data, setting up my sca object exactly as you did, I got several errors and many warnings:

Micro1_for_DE_test <- readRDS("~/Documents/10X_data/Micro1_for_DE_test.rds")
allgenes_JG <- as.data.frame(t(Micro1_for_DE_test@assays$SCT@scale.data))
allgenes_JG$orig.ident <- Micro1_for_DE_test@meta.data$orig.ident
allgenes_JG$sampID <- Micro1_for_DE_test@meta.data$sampID
allgenes_JG$wellKey <- paste(Micro1_for_DE_test@meta.data$sampID, rownames(Micro1_for_DE_test@meta.data), sep = "_")
rownames(allgenes_JG) <- allgenes_JG$wellKey
allgenes_JG <- allgenes_JG[,c(21018,21017,21016,1:21015)]
genecounts_JG <- as.matrix(t(allgenes_JG[,c(-1,-2,-3)]))
coldata_JG <- allgenes_JG[,1:3]
coldata_JG$orig.ident <- as.factor(coldata_JG$orig.ident)
genecounts_JG <- genecounts_JG[which(apply(genecounts_JG, 1, mean) > 0), ] #removed 12,415 genes for 8600 total
genecounts_JG <- genecounts_JG[,rownames(coldata_JG)]

fData_JG <- data.frame(primerid=rownames(genecounts_JG))
sca_JG <- suppressMessages(MAST::FromMatrix(exprsArray=genecounts_JG, cData=coldata_JG, fData=fData_JG)) 
cdr2_JG <- colSums(SummarizedExperiment::assay(sca_JG)>0)

SummarizedExperiment::colData(sca_JG)$ngeneson <- scale(cdr2_JG) 
SummarizedExperiment::colData(sca_JG)$orig.ident <- factor(SummarizedExperiment::colData(sca_JG)$orig.ident) 
SummarizedExperiment::colData(sca_JG)$sampID <- factor(SummarizedExperiment::colData(sca_JG)$sampID)

zlmCond_JG <- suppressMessages(MAST::zlm(~ ngeneson + orig.ident + (1 | sampID), sca_JG, method='glmer',ebayes = F, strictConvergence = FALSE)) 
Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GHrule(0L), compDev = compDev,  : 
  pwrssUpdate did not converge in (maxit) iterations
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error in (function (cl, name, valueClass)  : 
  assignment of an object of class “list” is not valid for @‘optimMsg’ in an object of class “LMERlike”; is(value, "character") is not TRUE
In addition: There were 21 warnings (use warnings() to see them)
Error in (function (cl, name, valueClass)  : 
  assignment of an object of class “list” is not valid for @‘optimMsg’ in an object of class “LMERlike”; is(value, "character") is not TRUE
In addition: There were 50 or more warnings (use warnings() to see the first 50)
There were 13 warnings (use warnings() to see them)

Looking at a recent post about the first error (lme4 issue #573), I'm concerned the negative values in my matrix are causing problems. In Seurat, I used SCTransform to normalize data from each sample - this function generates Pearson residuals from a regularized negative binomial regression, where cellular sequencing depth is utilized as a covariate in a generalized linear model. The Seurat team recommended using these residuals (stored in the scale.data slot) for differential expression analysis - see this vignette under "Where are normalized values stored for sctransform?". Another option would be to use corrected log-normalized counts in the data slot of the SCTransform assay in the Seurat object - I plan to try this next.

Do you have any insight for how to address these errors or would you recommend contacting the MAST and/or Seurat teams directly?

I appreciate your help!

kdzimm commented 3 years ago

@jgamache014

No need to apologize for the delay! This is not an error I have seen before, but I think you may be right that the error is being caused by the negative values from the SCTransfom normalization. In my experience with MAST, I am usually providing log2(x + 1) transformed values (either raw counts or TPMs) directly to MAST without doing normalization (just transformation) first. It is my understanding the "ngeneson" parameter in the model will account for "cell size" and act as a sort of normalization. I also noticed you are filtering based on mean > 0 here too. After a normalization that provides negative values, I am not sure this is doing the same thing. I included this in my code to remove genes that were zero in all cells. Personally, I would try using the data that are not normalized with SCTransform and see if that helps. If that doesn't do it, then it is either a lme4 or MAST issue (or a combination of the two) in which case you would probably be better off asking the MAST team about first (and then it may lead you to needing help from Ben Bolker with lme4). The MAST group has been very helpful to me in the past. I would start there if my first suggestion does not work.

Keep in mind that you are very likely to always get errors when fitting mixed models with these highly zero-inflated data. Finak discusses this very clearly here: https://github.com/RGLab/MAST/issues/148

I'll be honest, I am less familiar with Seurat. I have only used it to do cell clustering and make tSNEs. I don't know how it incorporates MAST into its software - but it appears the wrapper for it does not allow for random effects. I noticed @esrasefik was having some issues with this here: https://github.com/satijalab/seurat/issues/3712 . it appears you have already had some interaction with her. Do you know if she was able to solve her problems?

Finally, I will note the following (from our manuscript):

"While we recommend computing differential expression analysis using MAST with RE, alternative methods include using MAST with fixed-effects for individual, a tweedie GLMM, or permutation testing. Accounting for the within-sample correlation with a fixed-effect term for individual, will have a slight difference in interpretation, but should be considered an alternative option to random effects — particularly when the number of independent experimental units is modest. To not violate the exchangeability assumption, permutation methods must randomize at the independent experimental unit (e.g., individual) and properly account for covariates (i.e., conditional permutation). The tweedie GLMM method, which was selected for its distributional flexibility that can account for zero inflation, could be implemented using the “glmmTMB” R-package34, but other mixed models could also be applied using a more appropriate distribution"

In addition, I will note that if you have reasonable balance across samples (that is the number of cells within your cell type of interest is reasonably balanced across individuals) then a mean-of-means approach (pseudo-bulk - where you take the mean of expression in all cells within an individual) will perform comparably. This is not a bad option if MAST isn't working for you and you can't figure out any other methods.

I hope this helps!!

jgamache014 commented 3 years ago

@kdzimm, thank you so much for your help and advice!

I changed a few things according to your suggestions and it seems to be working now. I took the corrected count data from SCTransform and then log2(x + 1) transformed those values. I still got an error (grouping factors must have > 1 sampled level), so I filtered out genes expressed in <10% of cells, following recommendations here and here. I also got convergence warnings so I added fitArgsD = list(nAGQ = 0) to the zlm() function, as was described in the github post you linked above. Here is what I ran:

Micro1_for_DE_test <- readRDS("~/Documents/10X_data/Micro1_for_DE_test.rds")
allgenes_JG <- t(as.data.frame(Micro1_for_DE_test@assays$SCT@counts))
allgenes_JG <- as.data.frame(allgenes_JG)
ngenes <- ncol(allgenes_JG)
allgenes_JG$orig.ident <- Micro1_for_DE_test@meta.data$orig.ident
allgenes_JG$sampID <- Micro1_for_DE_test@meta.data$sampID
allgenes_JG$wellKey <- paste(Micro1_for_DE_test@meta.data$sampID, rownames(Micro1_for_DE_test@meta.data), sep = "_")
rownames(allgenes_JG) <- allgenes_JG$wellKey
allgenes_JG <- allgenes_JG[,c(ngenes+3,ngenes+2,ngenes+1,1:ngenes)]
genecounts_JG <- as.matrix(t(allgenes_JG[,c(-1,-2,-3)]))
genecounts_JG <- log2(genecounts_JG + 1)
coldata_JG <- allgenes_JG[,1:3]
coldata_JG$orig.ident <- as.factor(coldata_JG$orig.ident)
nrow(genecounts_JG) #31052
genecounts_JG <- genecounts_JG[rowSums(genecounts_JG>0)/(ncol(genecounts_JG))>=0.1,] #filter out genes expressed in <10% of cells
nrow(genecounts_JG) #6409
genecounts_JG <- genecounts_JG[,rownames(coldata_JG)]

fData_JG <- data.frame(primerid=rownames(genecounts_JG))
sca_JG <- suppressMessages(MAST::FromMatrix(exprsArray=genecounts_JG, cData=coldata_JG, fData=fData_JG))
cdr2_JG <- colSums(SummarizedExperiment::assay(sca_JG)>0)

SummarizedExperiment::colData(sca_JG)$ngeneson <- scale(cdr2_JG) 
SummarizedExperiment::colData(sca_JG)$orig.ident <- factor(SummarizedExperiment::colData(sca_JG)$orig.ident) 
SummarizedExperiment::colData(sca_JG)$sampID <- factor(SummarizedExperiment::colData(sca_JG)$sampID)

zlmCond_JG <- suppressMessages(MAST::zlm(~ orig.ident + ngeneson + (1 | sampID), sca_JG, method='glmer',ebayes = F, fitArgsD = list(nAGQ = 0), strictConvergence = FALSE))
#Warning message:
#  In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
#                 Model failed to converge with max|grad| = 0.00353329 (tol = 0.002, component 1)

summaryCond_JG <- suppressMessages(MAST::summary(zlmCond_JG,
                                              doLRT='orig.identNormal'))
summaryDt_JG <- summaryCond_JG$datatable
fcHurdle_JG <- merge(summaryDt_JG[contrast=='orig.identNormal' & component=='H',.(primerid, `Pr(>Chisq)`)], 
                      summaryDt_JG[contrast=='orig.identNormal' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], 
                      by='primerid')
fcHurdle_JG[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
fcHurdle_JG <- stats::na.omit(as.data.frame(fcHurdle_JG)) #2 genes omitted
head(fcHurdle_JG)
#  primerid Pr(>Chisq)        coef      ci.hi       ci.lo       fdr
#1      A2M 0.01240386 0.403803934 0.68463098  0.12297688 0.6306383
#2    AAGAB 0.67676338 0.018051752 0.06926577 -0.03316227 0.9831146
#3     AAK1 0.78425090 0.056680721 0.24825262 -0.13489118 0.9831146
#4    AAMDC 0.95044328 0.006356309 0.04978334 -0.03707072 0.9908932
#5     AARS 0.64261859 0.021193588 0.06544220 -0.02305502 0.9831146
#6    AASDH 0.83929780 0.005427686 0.04351511 -0.03265974 0.9831146

I still got a convergence warning, but I'm guessing that is due to the 2 genes that were filtered out using na.omit.

For interpreting the results in fcHurdle, are these definitions of the variables correct? (I was not able to find definitions in the MAST vignette.) Pr(>Chisq): raw p-value coef: logFC fdr: adjusted p-value ci.hi: ? ci.low: ?

Lastly, would positive logFC values indicate higher expression in the orig.ident Normal group? I did not set a reference level for this variable before running zlm() so I assume it used the non-Normal group by default.

Thanks again!

kdzimm commented 3 years ago

Looks good! Great job figuring it out! What you did is not easy. Hopefully, as single-cell continues to develop this will get easier for users to run mixed models with single-cell data. Yes, it is likely you will always have a set of genes that do not converge - particularly with sparse single-cell data and with small sample sizes.

ci.hi is the high end of the 95% confidence interval for your logFC. ci.low is the low end. I believe the Normal group is your reference level based on your doLRT='origin.identNormal' command, but I always like to double check this on one or two genes by hand with the raw data after the fact - just to be certain I am interpreting direction correctly.

Take good care! I am going to close this for now, but please let me know if you have any other questions!

jgamache014 commented 3 years ago

Hi @kdzimm,

I have 2 follow-up questions.

1) What do you think about adding cell type proportion data for each donor (for a given cell type, the number of cells for a donor divided by that donor's total cell count) as a covariate to the model? Say I'm comparing case vs. control in microglia only. There is sample-to-sample variation in cell type composition, which is something we want to control for. Donor 1 may have a larger proportion of microglia than Donor 2. Does it make sense to add microglia proportion for each donor as a covariate? Running the test this way in a trial run worked, and actually slightly increased the number of significant genes. An example of the proportion data is shown below (from colData_JG): image

2) We are also working with single-nucleus ATAC-seq data and have been running differential accessibility tests in Seurat using logistic regression with total number of fragments as a latent variable, as in this vignette. Do you think we could use your same framework to incorporate a random effect for individual for the ATAC data?

Thanks!

kdzimm commented 3 years ago

@jgamache014 ,

These are really good questions.

1) I think this is a good idea and I would say it makes complete sense to do! You may also want to consider computing a principal component analysis on your cell type proportions and adding a few of those principal components into your model as covariates instead. This may be better than just the raw proportion of your cell type of interest, because it is capturing even more information about the composition at large. I will note: One thing I still learning about myself is compositional data. It may be something you want to look into here, because there will be a negative correlation between the proportions of cells in each sample (i.e., a larger proportion of astrocytes in one sample will shrink the proportion of microglia in that same sample). You can read a bit more about that here: http://www.leg.ufpr.br/lib/exe/fetch.php/pessoais:abtmartins:a_concise_guide_to_compositional_data_analysis.pdf

2) Absolutely! We ourselves have been hoping for a while now to expand this framework to scATAC-seq (as well as other single-cell omics). I don't have any experience with ATAC-seq data and I don't know what the distributional assumptions etc. are there, but the concept of using a random effect for individual will most likely hold. In our concluding paragraph of our manuscript we wrote the following, which I believe is relevant:

"Although our focus here is on hypothesis testing for finding differentially expressed genes within a cell type across conditions, the concept is applicable when comparing expression patterns between cell types and is broadly appropriate for all single-cell sequencing technologies such as proteomics, metabolomics, and epigenetics."

Thanks for thinking through your analysis carefully and asking meaningful questions! I believe your research will be much better for it.

jgamache014 commented 3 years ago

Thank you so much for your thorough responses and for pointing me to these resources - I really appreciate your time and help!

jgamache014 commented 3 years ago

Hi @kdzimm! I've been using this pipeline on other clusters, and sometimes I get the following error after running zlm():

Error in (function (fr, X, reTrms, family, nAGQ = 1L, verbose = 0L, maxit = 100L, : pwrssUpdate did not converge in (maxit) iterations

It will still run though, and then I'll get a similar error when generating a results summary with LRT:

Error in (function (fr, X, reTrms, family, nAGQ = 1L, verbose = 0L, maxit = 100L, : pwrssUpdate did not converge in (maxit) iterations

This will also still run and I still get results, but I'm not sure if I can trust them. I'm just wondering if you've come across this error before? One of the few places I found the error referenced is lme4 issue #573. Do you recommend trying the workaround posted by bbolker?