kdkorthauer / scDD

R package to identify genes with differential distributions in single-cell RNA-seq
32 stars 15 forks source link

testZeroes issue #7

Closed szimmerman92 closed 7 years ago

szimmerman92 commented 7 years ago

HI,

I am using scDD version 0.99.0. When I ran the scDD function with the option testZeroes=TRUE I got the below error

RES_scdd_m1.m2 <- scDD(m1.m2.eset, prior_param=prior_param,testZeroes=TRUE)

Notice: There exist genes that are all (or almost all) zero. For genes with 0 or 1 nonzero measurements per condition, only testing for DZ Notice: There exist genes with constant nonzero values. These genes will only be considered for the DZ pattern. Clustering observed expression data for each gene Setting up parallel back-end using 8 cores Notice! Number of permutations is set to zero; using Kolmogorov-Smirnov to test for differences in distributions instead of the Bayes Factor permutation test Classifying significant genes into patterns Error in model.frame.default(formula = y > 0 ~ detection, drop.unused.levels = TRUE) : variable lengths differ (found for 'detection')

However, when I run the scDD function while setting testZeroes to FALSE, the function runs to completion. I traced the error down to the "testZeroes" function in the R file classify.dd.R. Specifically, it is line 268

M0 <- arm::bayesglm(y>0 ~ detection, family=binomial(link="logit"))

When I changed the code slightly to convert y>0 from a matrix to a logical vector like below

M0 <- arm::bayesglm(as.vector(y>0) ~ detection, family=binomial(link="logit"))

the code worked. I am not sure why my data caused this error but I thought I would let you know in case this was a bug. If you need any more information or files from me for replication purposes I am happy to provide them.

Best, Sam Zimmerman

kdkorthauer commented 7 years ago

Hi Sam,

Thanks for opening this issue. I'm guessing this could be due to the presence of some genes in your dataset that are not expressed in any cells (all zero). If that is the case, I think rather than forcing the test of differential zeroes on those genes, it would be better for me to modify the code to skip those genes.

Could you test out this theory by first subsetting your m1.m2.eset to include only the genes that have nonzero expression in at least one cell? Or let me know if there aren't any genes that satisfy this criteria. In that case, I'll have to do some more digging to figure out what is causing this.

Best, Keegan

szimmerman92 commented 7 years ago

Hi Keegan,

Thank you for the quick response. I subsetted the data to remove any genes that are not expressed in any sample and reran the analysis. Unfortunately, I am still getting the same error.

Best, Sam

kdkorthauer commented 7 years ago

Hi Sam,

I appreciate you checking on that for me. Given that information, I am not sure what is causing the issue with creating the model matrix. Although it seems you've already found a workaround using as.vector, I would like to find the root cause of this unexpected result to avoid getting spurious results due to some unforeseen issue.

Would it be possible to send me a portion of your dataset so that I can replicate the error and pinpoint the specific cases where this happens? My email is keegan@jimmy.harvard.edu.

Best, Keegan

kdkorthauer commented 7 years ago

Hi Sam,

Just checking in regarding this issue. I am very interested in determining the root cause, so I would greatly appreciate it if you could send me a subset of your dataset so that I can reproduce the error.

Best, Keegan

szimmerman92 commented 7 years ago

Hi Keegan,

Sorry to get back to you so late, but I have to confirm with my PI and collaborator to make sure this is okay with them. I have a meeting with both today, so I will send you the data at the latest by tomorrow if both parties approve. Thank you.

Best, Sam

kdkorthauer commented 7 years ago

Hi Sam,

I was unable to reproduce the error message with the subsetted data you provided. However, I wanted to note that the data you provided has an extremely low number of cells per condition (7). Ideally to use the scDD model, you'd want to have enough cells in each condition such that there are at a minimum 20 cells with expression for many genes. With just a few cells in each condition, it is not recommended to try to infer subgroups or test for a differential proportion of zeroes.

If you have only provided a subset of cells in the example you sent, please let me know. In that case, it might still be appropriate to apply scDD, but I'd need to take a look at a complete set of cells for a subset of genes to decide.

Best, Keegan

szimmerman92 commented 7 years ago

Hi Keegan,

Unfortunately I only have 7 cells per condition. Thank you for letting me know that 20 cells per condition is the recommended amount. That is very helpful as it will help guide other studies that I do. If you would like I can also send you the exact code that I ran to help replicate the error. Thank you.

Best, Sam

On Fri, Apr 14, 2017 at 2:38 PM, Keegan Korthauer notifications@github.com wrote:

Closed #7 https://github.com/kdkorthauer/scDD/issues/7.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kdkorthauer/scDD/issues/7#event-1043651121, or mute the thread https://github.com/notifications/unsubscribe-auth/AICkwANFJy6_fOEYLfp_5jBBpb0xAekgks5rv70YgaJpZM4MJSPj .