rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
181 stars 52 forks source link

Extreme p-values with stratified subcohorts #412

Open irisjansen opened 1 year ago

irisjansen commented 1 year ago

Hi!

I am trying to run regenie in UKB on subcohorts that are stratified based on a SNP (original p in total cohort is 1.06e-26) . On general level, I do the following:

Strangely, the locus that I stratified on, now has a strangely extreme association signal (e.g. for noncarriers, the locuszoom plot):

strange-signal

The sample size for this non-carrier group is 117152 individuals. I would expect the signal to be gone for the non-carriers. I am unsure why the signal is still there and so strong. Perhaps it has something to do with the fact that missing phenotypes are imputed during step 1, and removed during step 2?

I did a similar stratification analysis for a binary phenotype, and then the signal is gone, as you would expect.

Thank you in advance!

Best wishes, Iris

joellembatchou commented 1 year ago

Hi Iris,

Are you running those two phenotypes (one for carriers & other for non-carriers) together within the same REGENIE run (ie using the multitrait feature of REGENIE)?

For such an analysis, you need to run these separately, i.e. one run for the phenotype containing the carriers and another run for the other one. Note that you don't need to generate two phenotype files you can just use --phenoCol to specify the phenotype to use.

Alternatively, you could have also used the interaction testing feature of REGENIE, specifying dominant coding for the interacting SNP (which is the one you used to stratify by carries/non-carriers). See documentation here: https://rgcgithub.github.io/regenie/overview/#step-2-interaction-testing

Cheers, Joelle

rlongchamps commented 1 year ago

Hi Joelle,

I was looking for some clarification on when I am able to use Regenie's multitrait capabilities versus running phenotypes separately.

When running two quantitative traits through the pipeline I found that one of them was highly inflated. However, when running the inflated trait through the pipeline separately this inflation was corrected for.

My cohort isn't set up the same way as Iris's as I simply have 20K individuals with two different normally distributed phenotypes.

Any clarification as to why I am unable to run both phenotypes through Step2 at the same time would be greatly appreciated.

Thank you! -Ryan

Chr 21 manhattan when 2 phenos run together: image

Chr 21 manhattan when pheno run separately for Step 2:

image
joellembatchou commented 1 year ago

Hi Ryan,

Is there any missingness in your data for the phenotypes?

Cheers, Joelle

rlongchamps commented 1 year ago

Hi Joelle,

Yes, I hadn't thought about that when setting up this testing phenotype. Here is an excerpt from the log file:

   -number of phenotyped individuals  = 18774
 * covariates       : [merged_covars_testrun.txt] n_cov = 50
   -number of individuals with covariate data = 378521
 * number of individuals used in analysis = 15214
 * number of observations for each trait:
   - 'LF_Ejection_Fraction': 7591 observations
   - 'NTproBNP': 7817 observations
 * LOCO predictions : [quantitative_testrun_1_pred.list]
   -file [quantitative_testrun_1_1.loco] for phenotype 'LF_Ejection_Fraction'
    + 114 individuals with missing LOCO predictions will be ignored for the trait
   -file [quantitative_testrun_1_2.loco] for phenotype 'NTproBNP'
    + 127 individuals with missing LOCO predictions will be ignored for the trait
   -residualizing and scaling phenotypes...done (19ms) 

What would be the reason that only one of the phenotypes saw inflation despite having a similar missing pattern as the one that did not?

joellembatchou commented 1 year ago

Hi Ryan,

So the number of analyzed samples is 15k but for each phenotype the number of missing samples is ~7.5k so each phenotypes would have 50% of the data set to missing in step 1. I don't think the missingness patterns are similar across your two traits as REGENIE automatically removes all samples with missingness across all the analyzed phenotypes (so the 15,214 samples have either no missingness or one of the phenotypes being missing but not both). We recommend to have at most ~15% missingness so it would be best to analyze these phenotypes separately.

Cheers, Joelle