SirKuikka commented 2 years ago

Hi,

I would very much appreciate if you could show me how you simulated the data in the original paper. The manuscript explains some details, but seeing the R script would make repeating the simulation much easier.

lhe17 commented 2 years ago

Hi,

Sure. Please see the following example code for a simulation.

“

library(nebula)

number of subjects

ng=50

average number of cells per subject

fse=100

subject-level overdispersion parameter sigma, a larger value for more overdispersion. Typical value from 0.001 to 2.

sig2=0.05

cell-level overdispersion parameter, a larger value for more overdispersion. Typical value from 0.0001 to 100. My experience is that Smart-seq2 data often have much larger cell-level overdispersion than 10x data, probably because Smart-seq2 does not use UMI and thus introduces a lot of noises due to PCR duplicates.

sig3=0.2

average expression log(CPM)

lambda_s = -2

count = c()

simulate cell numbers across the subjects. The balanced case using a poisson distribution. The unbalanced case using a negative binomial distribution

fs = rpois(ng,fse)

fs = rnbinom(ng,size=3,mu=fse)

fs[which(fs==0)] = 1

total number of cells

n = sum(fs)

generate a data frame with n cells

df = data.frame(num=1:n)

simulate some cell-level variable of interest from a normal distribution, e.g., percentage of mitochondrial gene expression

df$x = rnorm(n)

simulate some subject-level variable of interest, e.g., a treatment variable from a Bernoulli distribution

zt = rbinom(ng,1,0.5)

df$z = as.numeric(unlist(sapply(1:ng,function(x) rep(zt[x],fs[x]))))

generate subject ID and cell ID

df$id = as.numeric(unlist(sapply(1:ng,function(x) rep(x,fs[x]))))

df$cell = c(1:nrow(df))

simulate subject-specific random effects from a gamma distribution, based on the subject-level overdispersion

indre = rgamma(ng,shape=1/(exp(sig2)-1),rate=1/(exp(sig2/2)*(exp(sig2)-1)))

df$ref = as.numeric(unlist(sapply(1:ng,function(x) rep(indre[x],fs[x]))))

simulate cell-specific random effects from a gamma distribution, based on the cell-level overdispersion

df$wincell = rgamma(n,shape=(1/sig3),rate=(1/sig3))

simulate the library size

df$offset = 1

generate the raw counts based on the variables, library size and random effects

effcell = 0 ## logFC of the cell-level variable

effsub = 0 ## logFC of the subject-level variable

df$y = rpois(n,exp(lambda_s+effcelldf$x+effsubdf$z+log(df$ref)+log(df$wincell)+log(df$offset)))

run nebula

pred = cbind(1,df$x,df$z)

re=nebula(df$y, df$id,pred=pred,offset= df$offset,model='NBGMM',method='LN',kappa = 800, cpc=1e-10,verbose=FALSE)

“

Best regards,

Liang

From: SirKuikka @.> Sent: Monday, September 6, 2021 8:00 AM To: lhe17/nebula @.> Cc: Subscribed @.***> Subject: [lhe17/nebula] Data simulation in the original paper (#3)

Hi,

I would very much appreciate if you could show me how you simulated the data in the original paper. The manuscript explains some details, but seeing the R script would make repeating the simulation much easier.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lhe17/nebula/issues/3 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDISUR2XG43KGZUYEMSHOLUASUL5ANCNFSM5DQJTIWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . https://github.com/notifications/beacon/AGDISUSTJ24BXPX3ZD7FWSTUASUL5A5CNFSM5DQJTIWKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4OXUNRYA.gif

SirKuikka commented 2 years ago

Hi,

Thanks, this helps. I was wondering how exactly did you estimate the type I error for this simulation? The manuscript says: "We then assessed the impact of such biases on controlling the type I error rate by testing the fixed-effects predictors under a wide range of settings." Did you randomly sort the subject or cells into some groups/predictors (e.g. case vs control), and then estimated the error (proportion of false positives) under the assumption that there should be no positive findings in such comparisons? Is there some way to estimate type II error or other metrics for the simulation?

lhe17 commented 2 years ago

Hi,

To get the empirical type I error rate, we run the example code a large number of times (e.g., K times) under the null hypothesis effsub=0 or effcell=0, depending on which variable you want to test. We then count how many p-values of a variable are smaller than e.g., 0.05. This number divided by K will give you the estimated empirical Type I error rate under the 0.05 level. You can tune the model parameters (ng, fse, size, sig2, …, etc) to simulate under different situations.

Regarding your general statistical questions, please refer to other statistical books or materials. I have no sufficient expertise to answer those questions.

Best regards,

Liang

From: SirKuikka @.> Sent: Monday, September 6, 2021 12:57 PM To: lhe17/nebula @.> Cc: lhe17 @.>; Comment @.> Subject: Re: [lhe17/nebula] Data simulation in the original paper (#3)

Hi,

Thanks, this helps. I was wondering how exactly did you estimate the type I error for this simulation? The manuscript says: "We then assessed the impact of such biases on controlling the type I error rate by testing the fixed-effects predictors under a wide range of settings." Did you randomly sort the subject or cells into some groups/predictors (e.g. case vs control), and then estimated the error (proportion of false positives) under the assumption that there should be no positive findings in such comparisons? Is there some way to estimate type II error or other metrics for the simulation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lhe17/nebula/issues/3#issuecomment-913771084 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDISUSZNETBQA7WOKPRAHDUATXFXANCNFSM5DQJTIWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . https://github.com/notifications/beacon/AGDISUROGUQ45A6I6VEIG6LUATXFXA5CNFSM5DQJTIWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOGZ3QUTA.gif

SirKuikka commented 2 years ago

Hi,

So basically effsub=0 means the gene is not DE and if I want DE genes, then I set effsub!=0. That's what I was getting at. Thanks for the help.

lhe17 commented 2 years ago

Yes, that’s true.

Best regards,

Liang

From: SirKuikka @.> Sent: Wednesday, September 8, 2021 6:53 AM To: lhe17/nebula @.> Cc: lhe17 @.>; Comment @.> Subject: Re: [lhe17/nebula] Data simulation in the original paper (#3)

Hi,

So basically effsub=0 means the gene is not DE and if I want DE genes, then I set effsub!=0. That's what I was getting at. Thanks for the help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lhe17/nebula/issues/3#issuecomment-915131069 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDISUUROCGLHMK5R35QTTLUA46BZANCNFSM5DQJTIWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . https://github.com/notifications/beacon/AGDISUUZSS6UVI7BBUA4KBTUA46BZA5CNFSM5DQJTIWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOG2F4VPI.gif

lhe17 / nebula

Data simulation in the original paper #3

number of subjects

average number of cells per subject

subject-level overdispersion parameter sigma, a larger value for more overdispersion. Typical value from 0.001 to 2.

average expression log(CPM)

simulate cell numbers across the subjects. The balanced case using a poisson distribution. The unbalanced case using a negative binomial distribution

fs = rnbinom(ng,size=3,mu=fse)

total number of cells

generate a data frame with n cells

simulate some cell-level variable of interest from a normal distribution, e.g., percentage of mitochondrial gene expression

simulate some subject-level variable of interest, e.g., a treatment variable from a Bernoulli distribution

generate subject ID and cell ID

simulate subject-specific random effects from a gamma distribution, based on the subject-level overdispersion

simulate cell-specific random effects from a gamma distribution, based on the cell-level overdispersion

simulate the library size

generate the raw counts based on the variables, library size and random effects

run nebula