rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
181 stars 52 forks source link

Inflation in interaction p-values #263

Open came203 opened 2 years ago

came203 commented 2 years ago

Hi Joelle,

I was testing out some of the new interaction analyses and found something that I can't really explain:

I'm looking at a quantitative imaging phenotype in UKB Europeans (N=33,000) and want to see interaction of the imputed UKB SNPs with a rare environmental binary exposure (present only in 75/33,000 IDs) that I put in the cat_covar list in step2. I understand that the power to detect something here is miniscule, especially for low-frequency variants, but I nevertheless ran the analysis in PLINK2 and now in regenie v3. I'm only looking at the term "ADD-INT_SNPxVAR=1" which corresponds to the PLINK term "ADDxVAR"

It seems that for MAF > 10%, everything works out nicely, the resulting p-values from PLINK2 and REGENIE seem to correlate quite well (Spearman's r=0.9) with some outliers as expected. In general, p-values in REGENIE are more conservative than PLINK. NOTCH3_WMH_MAF01

QQ-Plot REGENIE - seem to be fine and not too inflated grafik

QQ-PLOT PLINK - a bit inflated grafik

However, once I go to MAF < 10%, things start getting weird. P-values are heavily inflated in the REGENIE results, up to -log(P) of 150

WMH_MAFlower01

QQ-Plot REGENIE: Here you can see the extreme inflation grafik

In PLINK, the p-values are still quite well-behaved although a bit inflated QQ-Plot PLINK QQ_PLINK

I'm trying to understand what happens here, I doesn't seem to happen with the binary outcome phenotypes I tested (same environmental exposure). I haven't tested a quantitative outcome phenotype that's available in the whole dataset yet.

Rainer

joellembatchou commented 2 years ago

Hi Rainer,

We did not consider this extreme scenario where the risk factor is this rare so the current approach of using robust SE for variants with MAC>1000 (by default) may not be optimal here; similarly to how rare variants tested in GxE with robust SE can give inflation, the exposure being very rare can make it worse even if the variant tested is not so rare.

The heteroskedastic linear model (HLM) model does not suffer from this limitation of the robust SE so can you re-do the run but this time enforcing that HLM model be used for all variants (ie set high MAC threshold for HLM;eg --rare-mac 1000000000). How do the results look when you only use HLM strategy for GxE test?

Cheers, Joelle

Shicheng-Guo commented 2 years ago

Agree, robust SE will be very sensitive to rare variants and the association will be significant inflated. I am wondering what's the best cut-off for MAC and frequency of exposure? Hope to see some benchmarking test. Thanks. Shicheng

came203 commented 2 years ago

Dear all,

I re-ran the same analysis using the--rare-mac 1000000000. It seems like the p-values are now heavily deflated:

grafik

Note that these are all variants regardless of MAF.

Here are the variants <10% MAF that caused the problems before: grafik

Again, deflation of p-values happens here.

Best wishes Rainer

joellembatchou commented 2 years ago

Hi Rainer,

Thank you for including the results from only using HLM approach. As mentioned before, this is not a case we have considered when developing this strategy (ie very rare risk factor used in interaction) but we will do further investigation.

Cheers, Joelle