rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
188 stars 55 forks source link

Excluded SNP throwing low variance error in Step 1 #33

Closed Apb58 closed 4 years ago

Apb58 commented 4 years ago

Hello,

I have a set of ~400 patients with about 20000 phenotypes (gene expression markers), and ~50,000 variants, of which I am only using ~2400 using the --exclude option in the step 1 command. However, when I run the step 1, I see that one of the variants that is included in the excluded variant file is throwing a low variance error:

Start time: Mon Sep 14 14:18:59 2020

              |==============================|
              |        REGENIE v1.0.5.6      |
              |==============================|

Copyright (c) 2020 Joelle Mbatchou and Jonathan Marchini.
Distributed under the MIT License.

Log of output saved in file : GTEx_S1_bin_out.log

Options in effect:
  --step 1 \
  --pgen GTEx_test \
  --phenoFile GTEx_Exp_Pheno.2.txt \
  --bsize 500 \
  --lowmem \
  --lowmem-prefix lm_rg \
  --threads 2 \
  --out GTEx_S1_bin_out \
  --exclude problem_vars.txt

Fitting null model
 * pvar             : [GTEx_test.pvar] n_snps = 59247
   -removing variants specified in [problem_vars.txt]
     +number of variants remaining in the analysis = 2403
 * psam             : [GTEx_test.psam] n_samples = 404
 * pgen             : [GTEx_test.pgen]
 * phenotypes       : [GTEx_Exp_Pheno.2.txt] n_pheno = 20519
   -keeping and mean-imputing missing observations (done for each trait)
   -number of phenotyped individuals = 309
 * number of individuals used in analysis = 309
   -residualizing and scaling phenotypes...done (79ms)
 * # threads        : [2]
 * block size       : [500]
 * # blocks         : [7]
 * # CV folds       : [5]
 * ridge data_l0    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * ridge data_l1    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * approximate memory usage : 571MB
 * writing level 0 predictions to disk
   -temporary files will have prefix [lm_rg_l0_Y]
   -approximate disk space needed : 3GB
 * setting memory...done

Chromosome 1
 block [1] : 500 snps  (1ms)
   -residualizing and scaling genotypes...!! Uh-oh, SNP 3466 has low variance (=0).

... And here is the variant in the excluded variant file:

[luser]$ grep "^3466$" problem_vars.txt
3466

Is there an issue where the SNP exclusion file is not being read?

joellembatchou commented 4 years ago

There was a bug in older versions when printing the ID of the variant with low variance.

This was fixed in version 1.0.5.9

Apb58 commented 4 years ago

Hi Joelle,

Thank you; I have updated and tried running step 1 again. It appears that the correct variant is being excluded, but now it appears as though EVERY variant that is passed in is tagged as low variance. I attempted to expand the input variants by removing the exclusion set, to see if it was somehow an issue with the subset of inputs I was using before, but that does not seem to be the case:

              |==============================|
              |        REGENIE v1.0.6.0      |
              |==============================|

Copyright (c) 2020 Joelle Mbatchou and Jonathan Marchini.
Distributed under the MIT License.

Log of output saved in file : GTEx_S1_bin_out.log

Options in effect:
  --step 1 \
  --pgen GTEx_test \
  --bsize 100 \
  --lowmem \
  --lowmem-prefix lwmem_tmp \
  --out GTEx_S1_bin_out \
  --phenoFile GTEx_Exp_Pheno.2.txt \
  --exclude problem_vars.2.txt

Fitting null model
 * pvar             : [GTEx_test.pvar] n_snps = 59247
   -removing variants specified in [problem_vars.2.txt]
     +number of variants remaining in the analysis = 59246
 * psam             : [GTEx_test.psam] n_samples = 404
 * pgen             : [GTEx_test.pgen] 
 * phenotypes       : [GTEx_Exp_Pheno.2.txt] n_pheno = 20519
   -keeping and mean-imputing missing observations (done for each trait)
   -number of phenotyped individuals = 309
 * number of individuals used in analysis = 309
   -residualizing and scaling phenotypes...done (136ms) 
 * # threads        : [96]
 * block size       : [100]
 * # blocks         : [604]
 * # CV folds       : [5]
 * ridge data_l0    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * ridge data_l1    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * approximate memory usage : 570MB
 * writing level 0 predictions to disk
   -temporary files will have prefix [lwmem_tmp_l0_Y]
   -approximate disk space needed : 187GB
 * setting memory...done

Chromosome 1
 block [1] : 100 snps  (0ms) 
   -residualizing and scaling genotypes...!! Uh-oh, SNP 4440 has low variance (=0).

My analysis of the input variants indicates that none of the variant IDs identified have variance == 0 across the patient set, so I am a bit bewildered.

I have attached the input files, if you'd like to take a look.

GTEx_test.zip

Thanks!

joellembatchou commented 4 years ago

Hi,

I am not seeing the phenotype file you used as an input in the zip file you attached. Could you pass me the file or the list of sample IDs that were in the phenotype file?

Apb58 commented 4 years ago

Ah, my apologies; looks like the file was too big to attach (>100MB). I've attached a version that contains a subset of the columns (phenotypes) but keeps all the original rows (sample IDs). GTEx_Exp_Pheno.sub.txt.gz

joellembatchou commented 4 years ago

The link you provided is not working for me (when I click it re-directs to this page)

Apb58 commented 4 years ago

How about now? I have re-added it to the comment.

joellembatchou commented 4 years ago

So I ran PLINK2 to get the alternative allele counts from the PGEN file you used as an example

plink2 --pfile GTEx_test --exclude  problem_vars.2.txt  --keep GTEx_Exp_Pheno.2.txt  --out GTEx_test --freq 'counts' 

PLINK v2.00a3LM AVX2 Intel (25 Feb 2020)       www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GTEx_test.log.
Options in effect:
  --exclude problem_vars.2.txt
  --freq counts
  --keep GTEx_Exp_Pheno.2.txt
  --out GTEx_test
  --pfile GTEx_test

Start time: Tue Sep 29 12:41:38 2020
3711 MiB RAM detected; reserving 1855 MiB for main workspace.
Using up to 2 compute threads.
404 samples (0 females, 0 males, 404 ambiguous; 404 founders) loaded from
GTEx_test.psam.
59247 variants loaded from GTEx_test.pvar.
Note: No phenotype data present.
--exclude: 59246 variants remaining.
--keep: 309 samples remaining.
309 samples (0 females, 0 males, 309 ambiguous; 309 founders) remaining after
main filters.
Calculating allele frequencies... done.
--freq counts: Allele counts (founders only) written to GTEx_test.acount .

When I look into the resulting file:

head -n 20 GTEx_test.acount

#CHROM  ID  REF ALT ALT_CTS OBS_CT
1   3479    1   2   1   618
1   4241    1   2   1   618
1   4331    1   2   1   618
1   4360    1   2   1   618
1   4384    1   2   1   618
1   4391    1   2   1   618
1   4436    1   2   1   618
1   4438    1   2   1   618
1   4440    1   2   0   618
1   4477    1   2   0   618
1   4479    1   2   1   618
1   4487    1   2   0   618
1   4509    1   2   1   618
1   4511    1   2   1   618
1   4524    1   2   1   618
1   4553    1   2   0   618
1   4558    1   2   0   618
1   4587    1   2   1   618
1   4598    1   2   1   618

I do see variant 4440 has alternative allele count of 0 (a.k.a is monomorphic in this data set), and there are many others as well

cat GTEx_test.acount | awk '{if($5==0) {print}}' | wc -l
15449

If I use this to filter out variants with 0 ALT alleles:

cat GTEx_test.acount | awk '{if($5==0) {print $2}}' > problematic_variant_IDs.txt
cat problem_vars.2.txt >> problematic_variant_IDs.txt

and then re-run Regenie (I did it here on a single trait 'ENSG00000223972.4' for simplicity), it works fine

Start time: Tue Sep 29 12:57:35 2020

              |==============================|
              |      REGENIE v1.0.6.2.gz     |
              |==============================|

Copyright (c) 2020 Joelle Mbatchou and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.

Log of output saved in file : GTEx_S1_bin_out.log

Options in effect:
  --step 1 \
  --pgen GTEx_test \
  --bsize 100 \
  --lowmem \
  --lowmem-prefix lwmem_tmp \
  --out GTEx_S1_bin_out \
  --phenoFile GTEx_Exp_Pheno.2.txt \
  --phenoCol ENSG00000223972.4 \
  --exclude problematic_variant_IDs.txt

Fitting null model
 * pvar             : [GTEx_test.pvar] n_snps = 59247
   -removing variants specified in [problematic_variant_IDs.txt]
     +number of variants remaining in the analysis = 43797
 * psam             : [GTEx_test.psam] n_samples = 404
 * pgen             : [GTEx_test.pgen] 
 * phenotypes       : [GTEx_Exp_Pheno.2.txt] n_pheno = 1
   -dropping observations with missing values at any of the phenotypes
   -number of phenotyped individuals = 309
   -number of individuals remaining with non-missing phenotypes = 309
 * number of individuals used in analysis = 309
   -residualizing and scaling phenotypes...done (0ms) 
 * # threads        : [2]
 * block size       : [100]
 * # blocks         : [448]
 * # CV folds       : [5]
 * ridge data_l0    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * ridge data_l1    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * approximate memory usage : 7MB
 * writing level 0 predictions to disk
   -temporary files will have prefix [lwmem_tmp_l0_Y]
   -approximate disk space needed : 7MB
 * setting memory...done

Chromosome 1
 block [1] : 100 snps  (0ms) 
   -residualizing and scaling genotypes...done (0ms) 
   -calc working matrices...done (1ms) 
   -calc level 0 ridge...done (34ms) 
.
.
.
 block [447] : 100 snps  (0ms) 
   -residualizing and scaling genotypes...done (0ms) 
   -calc working matrices...done (1ms) 
   -calc level 0 ridge...done (22ms) 
 block [448] : 4 snps  (0ms) 
   -residualizing and scaling genotypes...done (0ms) 
   -calc working matrices...done (0ms) 
   -calc level 0 ridge...done (19ms) 

 Level 1 ridge...
   -on phenotype 1 (ENSG00000223972.4)...done (63483ms) 

Output
------
phenotype 1 (ENSG00000223972.4) : 
  0.01  : Rsq = 0.00329685, MSE = 0.995305<- min value
  0.25  : Rsq = 0.000264437, MSE = 1.00116
  0.5   : Rsq = 0.00140678, MSE = 1.01629
  0.75  : Rsq = 0.00622035, MSE = 1.03245
  0.99  : Rsq = 0.0118419, MSE = 1.04832
  * making predictions...writing LOCO predictions...done (51ms) 

List of blup files written to: [GTEx_S1_bin_out_pred.list] 

Elapsed time : 73.7608s
End time: Tue Sep 29 12:58:48 2020

Cheers, Joelle

evatosco commented 12 months ago

Hi! Same problem here. But I have my own cohort, so I can't share the whole thing for privacy reasons. I can tell you that I have an entire exome of approx 820 samples, and I'm still in step 1.

I tried this:

plink2 \
  --bfile ${input_vcf_basemodel}_pheno_sex_covars_recodID \
  --mac 100 \
  --write-snplist allow-dups \
  --out mac_snp_pass

to then use --extract mac_snp_pass.snplist for Step1. However, the mentioned error shows up: -residualizing and scaling genotypes...ERROR: !! Uh-oh, SNP 1:16862454:T:G has low variance (=0.000000). When I deleted that specific SNP to see if it was a one-time thing, it showed another one: -residualizing and scaling genotypes...ERROR: !! Uh-oh, SNP 1:16862461:T:A has low variance (=0.000000).

I saw this specific issue when investigating, so I created the count table and I searched for the SNP:

CHROM ID REF ALT PROVISIONAL_REF? ALT_CTS OBS_CT

1 1:16862454:T:G T G Y 811 1622 1 1:16862461:T:A T A Y 811 1622

Exactly the same. "Low variance" with AC-alt=811? Since they are contiguous, I tried once again deleting these SNPs from the extract file, but the error switches to the next variant similar to those, so it's definitely not a one-time thing.

So I tried the workaround @joellembatchou showed here (slightly different, I'm using --extract, not --exclude):

plink2 --bfile ${input_vcf_basemodel}_pheno_sex_covars_recodID \
--out search_test \
--freq 'counts'
cat search_test.acount | awk -F'\t' '{ if($5!=0) {print $2}}' > pass_snps_counts.snplist
cat pass_snps_counts.snplist >> mac_snp_pass.snplist

Now the error stays, but the variant is different: -residualizing and scaling genotypes...ERROR: !! Uh-oh, SNP 1:880040:C:T has low variance (=0.000000). Check-up in count table:

1 1:880040:C:T C T Y 1 1596

That variant does not pass the plink --mac filter, but it is added as a pass variant when I used the awk. I don't know what I'm doing wrong... I have no idea about what is happening. I tried a few filters to see if I could make regenie work somehow, and this formula I tried ended up working:

plink2 --bfile ${input_vcf_basemodel}_pheno_sex_covars_recodID \
--out search_test \
--freq 'counts'
cat search_test.acount | awk -F'\t' '{if($6 > 2 && $6 <= 800 ) {print $2}}' > mac_snp_pass.snplist

I am kind of burned out here, so I ended up making it work but I seek some professional and logical reasoning behind this, so I can look for a way to include as much variants as possible in step 1.

Any suggestions are welcome. Maybe I should use --extract mac_snp_pass.snplist from --mac 100 and then use --exclude with a file including IDs with AC > 800?

Thanks in advance (sorry for the long message).

Best, Eva

joellembatchou commented 12 months ago

Hi Eva,

What covariates are you passing in your run? The variance is computed on the genotype residuals so I am assuming these variants are highly correlated with some subset of the covariates (note that as a sanity check, if you run REGENIE with no covariates you should not be getting errors as the variants are not monomorphic).

Cheers, Joelle

evatosco commented 12 months ago

Hi Joelle! Thank you so much for answering so fast.

Here's the code I have been using to test the 1st step:

regenie \
  --step 1 \
  --bed ${input_vcf_basemodel}_pheno_sex_covars_recodID \
  --bt -1 --write-samples --no-split --verbose  \
  --bsize 1000 \
  --extract ${input_dir}/mac_snp_pass.snplist \
  --lowmem-prefix tmp_rg \
  --check-burden-files \
  --phenoFile ${input_dir}/VALID_phenofile_spaces.txt \
  --phenoCol ARDS \
  --out "${input_vcf_basemodel}"_regenieS1

The plink BED (+BIM/FAM) files contain the covariates but I do not indicate a single covariate in this code for now. I have planned to do it in the future but I'm not there yet.

Any suggestions are completely welcome, as usual!

joellembatchou commented 12 months ago

I see, when running the PLINK command to filter by MAC, can you pass the list of samples with non-missing phenotypes that will be analyzed in REGENIE for phenoCol "ARDS" (using --keep)? I am guessing once you subset to these samples, the variants that gave errors in REGENIE have MAC 0.

evatosco commented 11 months ago

Hi again! Sorry for the delay.

So I tried a few things: I turned everything back to only creating a mac pass list with mac > 100 with plink and then adding --keep to regenie step 1: it still happens with the same SNP, even though it captured the total N of non-missing phenotype values (822):

   -dropping observations with missing values at any of the phenotypes
   -number of phenotyped individuals with no missing data = 822
 * number of individuals used in analysis = 822
 * case-control counts for each trait:
   - 'ARDS': 272 cases and 550 controls
   -fitting null logistic regression on binary phenotypes...done (0ms) 
   -residualizing and scaling phenotypes...done (0ms) 
   -WARNING: Sample size is less than 5,000 so using LOOCV instead of 5-fold CV.
 * # threads        : [19]
 * block size       : [1000]
 * # blocks         : [97] for 85982 variants
 * # CV folds       : [822]
 * ridge data_l0    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * ridge data_l1    : [5 : 0.01 0.25 0.5 0.75 0.99 ]
 * approximate memory usage : 980MB
 * writing level 0 predictions to disk
   -temporary files will have prefix [tmp_rg_l0_Y]
   -approximate disk space needed : 4MB
 * setting memory...done

Chromosome 1
 block [1] : 1000 snps  (8ms) 
   -residualizing and scaling genotypes...done (26ms) 
   -calc working matrices...done (212ms) 
   -calc level 0 ridge...1 chunks...done (11ms) 
 block [2] : 1000 snps  (6ms) 
   -residualizing and scaling genotypes...ERROR: !! Uh-oh, SNP 1:16862454:T:G has low variance (=0.000000).

So I did as you said and I did the exact same as before but also adding the --keep to the plink mac pass command, then the same thing happened: same output as before, exactly in the same SNP.

Next, I tried the plink mac pass command (with --keep) and then created the .acount file, to do cat search_test.acount | awk -F'\t' '{ if($6!=0) {print $2}}' >> mac_snp_pass.snplist. Then I ran the step 1:

Chromosome 1
 block [1] : 1000 snps  (7ms) 
   -residualizing and scaling genotypes...ERROR: !! Uh-oh, SNP 1:880040:C:T has low variance (=0.000000).

So... apparently that did not seem to change a lot. I wish I knew what's happening. I really appreciate your help on this! I'm totally lost.

Best regards, Eva

joellembatchou commented 11 months ago

Hi Eva,

Can you include the PLINK2 command you are using with --keep as well as the log? And can you include the full REGENIE log as well?

Also, for that variant "1:880040:C:T", can you compute allele count in PLINK as in --extract <(echo 1:880040:C:T) --freq counts and show what the count look like from PLINK command? (still using --keep)

Isabella678 commented 3 months ago

Hi, i have the same problem. I want to perform rare variants (MAF < 1%) analysis in Regenie, and have tried many times to follow your earlier advice to others, but still encounter the same error: Uh-oh, SNP chr1_1337985_T_C;chr1_1337983_TCTG_T;chr1_1337985_T_G has low variance (=0.000000).
I can show your my code: plink2--pgen "$PGEN_FILE" \ --pvar "$PVAR_FILE" \ --psam "$PSAM_FILE" \ --maf 0.01 \ --mac 100 \ --write-snplist allow-dups \ --out out_step1 plink2 --pgen "$PGEN_FILE" \ --pvar "$PVAR_FILE" \ --psam "$PSAM_FILE" \ --max-maf 0.01 \ --mac 100 \ --write-snplist allow-dups \ --out out_step2 And then, i run the regenie: regenie --pgen "$PGEN_FILE" \ --phenoFile "$PHENO_FILE" \ --extract "$SNPLIST" \ --out "$OUTPUT_PREFIX" \ --threads 10 \ --step 1 \ --lowmem \ --lowmem-prefix lwmem_tmp \ --bsize 100 \ --pThresh 0.05 \ --force-step1 \ --force-qt but at step1, i still got error with the same problem...i am very frustrated... 1722323160626 Could you please gave me some advices? Thanks! Isabella

joellembatchou commented 1 month ago

Hi Isabella,

Please use --keep when running the PLINK command to ensure the same set of samples analyzed in REGENIE are used to compute MAC/MAF. If the error still occurs, it could be due to one of the covariates being highly correlated with the variant (you could check with the first variant that gives you the error).

Cheers, Joelle