Open SoCadiou opened 3 weeks ago
Hi, Thank you for reporting this. There is indeed an issue in fastGWA when dealing with very rare SNPs (MAC == 1 or 0). This has been reported by other users as well, where the "N" reported for these ultra rare SNPs are wrong. We will have it fixed.
However, this should be lead to wrong MAFs. When running fastGWA, the genotype rate/MAF will be calculated based on the subset of individuals after removing those with missing phenotypes/covariates, so the actual sample size here will probably not be 9217. I am guessing this is the reason here. Can you paste your log file here for us to look into?
Hi,
Thank you very much for your reply and for the explanation about the behaviour for super rare variants.
Actually, our total sample size is 9219 and we know that 2 individuals have one missing covariate, making the actual sample size being 9217 (I think that appears in the log.)
Here is our log:
Analysis started at 14:05:27 CEST on Wed Apr 03 2024. Hostname: cnode01
Options:
--pfile /center/healthds/pQTL/BELIEVE/Genetic_QC_files/HDS_BELIEVE_final_HDS --maf 0.0001 --geno 0.1 --grm-sparse /exchange/healthds/pQTL/BELIEVE/Genetic_QC_files/GRM/newsparsegrm_0.025 --fastGWA-mlm --pheno /localscratch/10037223.solene.cadiou/phen_seq.10000.28_res --thread-num 10 --out /exchange/healthds/pQTL/results/BELIEVE/gcta/geno_assoc_phen_GCTAdefaultfilterseq.10000.28_res
The program will be running with up to 10 threads. Reading PLINK sample file from [/center/healthds/pQTL/BELIEVE/Genetic_QC_files/HDS_BELIEVE_final_HDS.psam]... 9219 individuals to be included from the sample file. Reading phenotype data from [/localscratch/10037223.solene.cadiou/phen_seq.10000.28_res]... 9217 overlapping individuals with non-missing data to be included from the phenotype file. 9217 individuals to be included. 0 males, 0 females, 9217 unknown. Reading PLINK2 PVAR file from [/center/healthds/pQTL/BELIEVE/Genetic_QC_files/HDS_BELIEVE_final_HDS.pvar]... 16084709 SNPs to be included from PVAR file(s). Threshold to filter variants: MAF > 0.000100, missingness rate < 0.100000. Reading the sparse GRM file from [/exchange/healthds/pQTL/BELIEVE/Genetic_QC_files/GRM/newsparsegrm_0.025]... After matching all the files, 9217 individuals to be included in the analysis. Estimating the genetic variance (Vg) by fastGWA-REML (grid search)... Iteration 1, step size: 0.0157227, logL: -4530.81. Vg: 0.0943365, searching range: 0.0786137 to 0.110059 Iteration 2, step size: 0.00209637, logL: -4530.81. Vg: 0.0890956, searching range: 0.0869992 to 0.0911919 Iteration 3, step size: 0.000279515, logL: -4530.81. Vg: 0.0900739, searching range: 0.0897944 to 0.0903534 Iteration 4, step size: 3.72687e-05, logL: -4530.81. Vg: 0.0899807, searching range: 0.0899434 to 0.090018 Iteration 5, step size: 4.96916e-06, logL: -4530.81. Vg: 0.0899732, searching range: 0.0899683 to 0.0899782 Iteration 6, step size: 6.62555e-07, logL: -4530.81. Vg: 0.0899742, searching range: 0.0899736 to 0.0899749 Iteration 7, step size: 8.83407e-08, logL: -4530.81. Vg: 0.0899738, searching range: 0.0899738 to 0.0899739 Iteration 8, step size: 1.17788e-08, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 9, step size: 1.5705e-09, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 10, step size: 2.094e-10, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 11, step size: 2.792e-11, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 12, step size: 3.72267e-12, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 13, step size: 4.96356e-13, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 fastGWA-REML converged. logL: -4530.81 Sampling variance/covariance of the estimates of Vg and Ve: 0.00348121 -0.00345104 -0.00345104 0.0036305
Source Variance SE Vg 0.0899739 0.0590018 Ve 0.892698 0.0602537 Vp 0.982672
Heritability = 0.0915605 (Pval = 0.127275) fastGWA-REML runtime: 0.269095 sec. Warning: the estimate of Vg is not statistically significant (i.e., p > 0.05). This is likely because the number of closely related individuals in the sample is not large enough. In this case, the program will use linear regression for association test.
Performing fastGWA linear regression analysis... fastGWA results will be saved in text format to [/exchange/healthds/pQTL/results/BELIEVE/gcta/geno_assoc_phen_GCTAdefaultfilterseq.10000.28_res.fastGWA]. 29.0% Estimated time remaining 12.3 min 59.1% Estimated time remaining 6.9 min 89.0% Estimated time remaining 1.9 min 100% finished in 1019.1 sec 16084709 SNPs have been processed. Saved 16082749 SNPs.
Analysis finished at 14:22:45 CEST on Wed Apr 03 2024 Overall computational time: 17 minutes 18 sec.
Thanks again for your help!
Hi, Thank you for sharing the log file. The log seems okay without obvious issues/bugs. May we ask you to check/do a few more things? This will help us better identify where the incorrect AF problem came from.
As a side note, since you are performing GWAS for rare/ultra-rare variants, you may consider running gene-based tests like GCTA-fastGWA-BB (https://yanglab.westlake.edu.cn/software/gcta/#fastGWA-BB), SAIGE-Gene, Regenie's gene-based test, etc..
You also mentioned that you filtered the SNPs using Plink with -geno 0.1 -mind 0.1 and -mac 20.
However, in the log file you shared, the flag for MAF was MAF >= 0.0001
, which is equivalent to MAC = MAFsample_size2 = 0.000192172 = ~2. So if we want GCTA to use the same MAC >= 20 cutoff, we need to use --maf 0.001
instead.
Alternatively, you can provide GCTA with a SNP list with the --extract filtered_snp_list.txt
flag. The SNP list can be the one you obtained from your previous Plink filtering step.
Hi developers,
While performing GWAS using fastGWA within GCTA on our pgen genetic data, some snps are removed (1960 out of 16084709, with sample size=9217). While forcing GCTA not to filter them using the -nofilter flag, all snps are kept. Looking at the “-nofilter” results for the snps which were removed by default, we can see that they were removed because of MAF filtering, which seems to be computed wrongly. Here are some examples
CHR SNP POS A1 A2 N AF1 BETA SE P