Skewed p-value distribution

kimkh415 commented 1 year ago

Hello DAESC team,

I have a question about interpreting the result using the DAESC-MIX model. I have tested ~10000 genes for allelic imbalance among three conditions. Then I plotted a histogram of p-values.

Looking at the p-value distribution, it is very skewed, identifying almost all genes as significant. Can you share your insights on this?

The dataset used here is 10X Visium.

Thank you!

gqi commented 1 year ago

DAESC-Mix is developed for datasets with many individuals (e.g. >20). It can lead to overfitting when the number of individuals is small, which is likely to have caused the skewed p-values. I recall you have 2 conditions and 2 individuals per condition. For this application I would suggest using DAESC-BB.

seanken commented 1 year ago

Hey! Following up (I am working with Kwanho on this, excited to play more with DAESC, seems really cool!) we've tested in a few settings (ranging from a few samples per condition to a couple dozen) and had the same issue (both an issue with the bb and mix methods). Digging into it looks like the issue only happens when we have >2 conditions (so when the input model matrix, x, has >1 column). Is that expected behavior? Looking at the code from DAESC I am guessing it is built to assume there is only one column in the model matrix (think both setting param=param.init[-2] when fitting the null model and using df=1 in the p-value calculation are built on this assumption I think, assuming I am following along with the code correctly?), though not 100% sure? Thanks!

gqi commented 1 year ago

Hi Sean, Thank you for pointing out the issue. You were right that the null model fitting assumed one covariate. I have updated the code to allow an arbitrary number of covariates. This is achieved through a new argument xnull, which is the design matrix under the null hypothesis. Users can specify any null model through xnull. Please see the README page for the updated example.

In addition, the package now no longer appends 1's (intercepts) to the design matrix. It instead uses the original matrix that users specify. If users would like to include an intercept, they should specify it in argument x. I think this addresses Kwanho's earlier question on design matrix specification.

Guanghao

seanken commented 1 year ago

Hey,

Awesome, thanks! Excited to give it a try!

Best wishes,

Sean

seanken commented 1 year ago

One more thought: looks like the code still uses df=1 for the LRT test (the pchisq command), but if I remember right I think that doesn't work if the null hypothesis and the full model differ by more than 1 variable (will lead to overinflated p-values in other cases I think)?

gqi commented 1 year ago

Sorry for missing this. I updated the df to be ncol(X)-ncol(XNull). Thanks for looking at the code!

seanken commented 1 year ago

Thanks!

kimkh415 commented 1 year ago

Thanks again Guanghao! Will close this issue now.

gqi / DAESC

Skewed p-value distribution #2