heiniglab / scPower

Experimental design framework for scRNAseq population studies (eQTL and DE)
44 stars 5 forks source link

non ideal gamma.mixed.fits crash #6

Closed daniel-spies closed 2 years ago

daniel-spies commented 2 years ago

hi there,

brief description of the issue:

I'm using a public datasets consisting of healthy cells from two timepoints, I'm considering the full data though used the two timepoints separately to generate gamma fits for each timepoint and the full dataset in order to generate cell-type specific gamma.mixed.fits.

When I then try to use the mixed.gamma.fits with the optimize.constant.budget function I get the following error:

Error in if (fdr.optimization(x = lowerBound, fdr = sign.threshold, m0 = m0,  : 
  missing value where TRUE/FALSE needed

Debugging the functions I figured out that in the calculate.probabilities function the qgamma distribution is partially creating NA values and therefore also NA dispersions. This in turn leads to NAs in the expression probabilities which in turn create NAs in the summed up exp.probs. These create final NA in lowerBound and m0 to then throw the error in fdr.optimization

exp.genes<-round(sum(sim.genes$exp.probs))
    lowerBound<-sign.threshold/exp.genes
    m0<-exp.genes-round(sum(foundSignGenes$exp.probs))

I guess the easiest would be to omit NAs in the sum function. For me it's hard to judge as for sure I'm not dealing with ideal data and most likely should not use these fits. though overall 1000 genes out of the 21k sampled from the qgamma are affected.

Would you suggest to rather use an overal gamma.fit for each celltype instead and run the power functions in a loop to get the different combinations instead?

many thanks Daniel

KatharinaSchmid commented 2 years ago

Hi Daniel,

thanks for reporting the issue. I agree, ignoring the NA expression probabilities in the sum is the easiest fix for your problem, which I could implement immediately. But it would be better to check where the NA values come from and rather solve this. If you say that you get NA dispersion values, this would suggest that the negative binomial fit is not working for some of the genes. Have you checked which genes are affected, for example the very lowly expressed ones? If something like this is the case, it might to be the best option to set these genes to zero before running the gamma fit.

You said that you work with public available data. If you want and it is possible, you could also share your data set and code with me, then I can have a look at the problem myself. I didn't get such an error so far.

Regards, Katharina

daniel-spies commented 2 years ago

I started using the filtered count matrix of the authors. Just using the gamma.fit for the whole data and the power.same.ReadDepth.restrictDoublets function works perfectly fine, just creating my own gamma.mixed.fits is causing the errors.

As I started with counts I did not do any sub-sampling using reads but splitting the matrix by phenotype. As it's two different time points in development my guess now would be that the mixed.gamma.fits for each phenotype + the whole data then encounters problems with genes specific for a certain phenotype.

This is most likely the case as for each cell-type I'm selecting the most highly expressed genes from the healthy reference scRNA-seq and combine it with the fold-changes of a bulk RNA-seq WT vs. KO to simulate the power of a potential scRNA-seq KO dataset.

I will try to set lowly expressed genes to 0 in the subsets as you suggested. Alternatively instead of subsetting I was thinking to combine the gamma.fits of several datasets instead and see if that's then more accurate.

I'll keep you updated

thanks Daniel

daniel-spies commented 2 years ago

Dear Katharina,

replacing the lowly expressed gene counts with 0 did the trick for the gamma.mixed.fits and allowed me to run the optimize.constant.budget function. Though I also noticed with the new gamma.fits filtering lowly expressed genes the overall power detection is reduced comparing the outputs of power.sameReadDepth.restrictedDoublets. My guess is that the removal of the lowly expressed genes will increase the correlation of cells as they seem to be very similar for the higher expressed genes or that many of those are actually among the top DEGs in the bulk-seq. Thus, for me highlighting also the importance of the ratio of DEGs to non-DEGs that are simulated and overlap with reference DEGs.

replacing the lowly expressed genes only for the gamma fitting step though worked like a charm, as you suggested.

many thanks for the fast replies and feedback Daniel