rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
182 stars 53 forks source link

Need to set --rare-mac to a large value to get robust interaction results when interactor is rare #417

Open Wainberg opened 1 year ago

Wainberg commented 1 year ago

I'm doing an interaction test in the UK Biobank between a particular candidate gene's rare-variant burden (100 carriers out of 500,000) and each other gene's rare-variant burden. (I specified the candidate gene's burden as a column in the covariate file and used --interaction for convenience.)

I got hugely inflated results for some genes (like p < 1e-50), even though the overall inflation was minimal. All of the top "hit" genes I checked had only two carriers of rare variants in both the candidate gene and the "hit" gene, but because the number of rare-variant carriers in the "hit" gene was > 1000 (which is the --rare-mac threshold), it uses the sandwich estimator rather than the more robust heteroskedastic linear model that's used for genes below the --rare-mac threshold.

I solved the problem by setting --rare-mac 999999 - all of the "hit" genes went away and there were no true positive hits. What do you think of scrapping the sandwich estimator and using the heteroskedastic linear model for all variants regardless of frequency? You could set --rare-mac to infinity by default, in other words.

joellembatchou commented 1 year ago

Hi,

We will be testing an alternate strategy which test with the sandwich estimator for all variants and switch to HLM for tests with p-value below nominal significance (e.g. 0.05). We wanted to keep using robust SE method as a baseline as it robust to general model misspecification. Thanks for pointing out this case.

Cheers, Joelle