Open Wainberg opened 1 year ago
Hi,
We will be testing an alternate strategy which test with the sandwich estimator for all variants and switch to HLM for tests with p-value below nominal significance (e.g. 0.05). We wanted to keep using robust SE method as a baseline as it robust to general model misspecification. Thanks for pointing out this case.
Cheers, Joelle
I'm doing an interaction test in the UK Biobank between a particular candidate gene's rare-variant burden (100 carriers out of 500,000) and each other gene's rare-variant burden. (I specified the candidate gene's burden as a column in the covariate file and used
--interaction
for convenience.)I got hugely inflated results for some genes (like p < 1e-50), even though the overall inflation was minimal. All of the top "hit" genes I checked had only two carriers of rare variants in both the candidate gene and the "hit" gene, but because the number of rare-variant carriers in the "hit" gene was > 1000 (which is the
--rare-mac
threshold), it uses the sandwich estimator rather than the more robust heteroskedastic linear model that's used for genes below the--rare-mac
threshold.I solved the problem by setting
--rare-mac 999999
- all of the "hit" genes went away and there were no true positive hits. What do you think of scrapping the sandwich estimator and using the heteroskedastic linear model for all variants regardless of frequency? You could set --rare-mac to infinity by default, in other words.