rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
181 stars 53 forks source link

SNP has low variance. #542

Open Isabella678 opened 1 month ago

Isabella678 commented 1 month ago
          Hi, i have the same problem.

I want to perform rare variants (MAF < 1%) analysis in Regenie, and have tried many times to follow your earlier advice to others, but still encounter the same error: Uh-oh, SNP chr1_1337985_T_C;chr1_1337983_TCTG_T;chr1_1337985_T_G has low variance (=0.000000).
I can show your my code: plink2--pgen "$PGEN_FILE" \ --pvar "$PVAR_FILE" \ --psam "$PSAM_FILE" \ --maf 0.01 \ --mac 100 \ --write-snplist allow-dups \ --out out_step1 plink2 --pgen "$PGEN_FILE" \ --pvar "$PVAR_FILE" \ --psam "$PSAM_FILE" \ --max-maf 0.01 \ --mac 100 \ --write-snplist allow-dups \ --out out_step2 And then, i run the regenie: regenie --pgen "$PGEN_FILE" \ --phenoFile "$PHENO_FILE" \ --extract "$SNPLIST" \ --out "$OUTPUT_PREFIX" \ --threads 10 \ --step 1 \ --lowmem \ --lowmem-prefix lwmem_tmp \ --bsize 100 \ --pThresh 0.05 \ --force-step1 \ --force-qt but at step1, i still got error with the same problem...i am very frustrated... 1722323160626 Could you please gave me some advices? Thanks! Isabella

Originally posted by @Isabella678 in https://github.com/rgcgithub/regenie/issues/33#issuecomment-2257630717

dvg-p4 commented 1 month ago

Are those variants so rare that they're not present at all? "Variance = 0.0" sounds like they might have AAC = 0. (Or AAF = 1.0)

Isabella678 commented 1 month ago

Are those variants so rare that they're not present at all? "Variance = 0.0" sounds like they might have AAC = 0. (Or AAF = 1.0)

But my research is focused on rare variants, so I have already filtered the MAF values, only retaining variants with MAF < 0.01. How should I modify my parameters to use regenie for the analysis of rare variants?

Thanks.

Ojami commented 3 weeks ago

In step 1, you should use common variants of high quality (e.g. directly genotyped or high coverage from WGS/WES). Please see here for more details:

We included in step 1 of REGENIE (that is, prediction of individual trait values based on the genetic data) array variants with MAF > 1%, <10% missingness, Hardy–Weinberg equilibrium test P > 10−15 and linkage disequilibrium (LD) pruning (1,000 variant windows, 100 variant sliding windows and r2 <0.9). We excluded from step 1 any SNPs with high inter-chromosomal LD, in the major histocompatibility (MHC) region, or in regions of low complexity.

The whole purpose of step 1 is to find the best genetic predictors of your phenotype to remove cryptic relatedness/population stratification. This can ideally be achieve by independent common variants. Once you generated your pred file, then you may move to step 2 (unless you want to skip step 1, which is OK as long as you are sure about the homogeneity of your cohort). For rare variant analysis, instead of testing each rare variant separately, check gene-/region-based tests.

Cheers Oveis

Isabella678 commented 1 week ago

Thank you very much for your response. Following your advice, I first included only the variants with MAF greater than 0.01 in step 1 to generate the prediction file for step 2. The code for step 1 is as follows:

regenie --pgen "$PGEN_FILE" \
        --phenoFile "$PHENO_FILE" \
        --out "$OUTPUT_PREFIX" \
        --threads 10 \
        --step 1 \
        --lowmem \
        --force-qt \
        --bsize 100

Then, for the second step, I used the following code:

regenie --step 2 \
       --pgen "$PGEN_FILE" \
       --covarFile "$COV_FILE" \
       --phenoFile "$PHENO_FILE" \
       --bt \
       --firth --approx \
       --pred out_step1_pred.list \
       --anno-file "$ANNO_FILE" \
       --set-list "$SET_LIST" \
       --mask-def "$MASK_FILE" \
       --write-mask \
       --bsize 200 \
       --aaf-bins 0.01 \
       --minMAC 10 \
       --check-burden-files \
       --vc-maxAAF 0.01 \
       --out "$OUTPUT_PREFIX"

After that, I obtained the .regenie file. Is this the correct approach? Thanks.

Ojami commented 1 week ago

What's the reason for --force-qt in step 1? Also, you forgot to include covarFile in step 1. Other than that, it seems OK.

Cheers/Oveis

Isabella678 commented 6 days ago

Dear Oveis,

I would like to express my gratitude for your prompt response. The reason for employing the --force-qt flag is due to my phenotype being a binary variable. Without this parameter, I encountered the following error:

ERROR: phenotype 'neuro_sz' has very few unique values (=2). If you really want to analyze it as a QT, use--force-qt.

Regarding the covariate file, I appreciate your suggestion, and I have already included it in step 1.

Additionally, I have a question about gene-based analysis. I have defined my mask file into three Masks as follows:

##MASKS=<Mask1="frameshift,stopgain,startloss,stoploss"; Mask2="nonsynonymous"; Mask3="nonframeshift">

Upon obtaining the .regenie file, I noticed that some results underwent Firth correction (out of 17,824 gene sets, 2,403 gene sets underwent Firth correction). Subsequently, I used the obtained LOG10P values for multiple testing correction, and the P-values were almost all 1. I am curious if the Firth correction is necessary in step 2. If I were to use FDR or Bonferroni correction, would it be sufficient to correct within each Mask rather than correcting all results together (for example, with n=17,824)?

I look forward to your insights on this matter.

Best regards, Isabella

Ojami commented 6 days ago

It helps if you post the full log (you can also use --debug for more detailed log).

The reason for employing the --force-qt flag is due to my phenotype being a binary variable. Without this parameter, I encountered the following error:

ERROR: phenotype 'neuro_sz' has very few unique values (=2). If you really want to analyze it as a QT, use --force-qt.

Then you should treat your phenotype as a binary trait (--bt flag). Treating bt as qt is a clear example of model misspecification.

Upon obtaining the .regenie file, I noticed that some results underwent Firth correction (out of 17,824 gene sets, 2,403 gene sets underwent Firth correction). Subsequently, I used the obtained LOG10P values for multiple testing correction, and the P-values were almost all 1. I am curious if the Firth correction is necessary in step 2.

The necessity of Firth's penalty depends on numerous factors, among them: case/control imbalance and the freq. of genetic predictors. In the context of gene-/region-based tests, Firth penalty is usually recommended (See here). By default REGENIE falls back to Firth correction if nominal p-values are below --pThresh (default is 0.05). So 2,403 genes in your case had a nominal p < 0.05.

If I were to use FDR or Bonferroni correction, would it be sufficient to correct within each Mask rather than correcting all results together (for example, with n=17,824)?

This has also been covered here and already implemented in REGENIE, so you can use joint GENE_P values which accounts for all mask/aaf combinations. Then you can apply your multiple testing correction method (FDR, FWER, etc) to GENE_P (or other joint p-values that REGENIE offers).

Last but not least, always check the inflation of your test statistics (wether you end up with joint tests or either of masks). Hope this helps.

Good luck! Oveis