Wrong filtering due to internal allele frequencies computation when using fastGWA

SoCadiou commented 3 weeks ago

Hi developers,

While performing GWAS using fastGWA within GCTA on our pgen genetic data, some snps are removed (1960 out of 16084709, with sample size=9217). While forcing GCTA not to filter them using the -nofilter flag, all snps are kept. Looking at the “-nofilter” results for the snps which were removed by default, we can see that they were removed because of MAF filtering, which seems to be computed wrongly. Here are some examples

CHR SNP POS A1 A2 N AF1 BETA SE P

1: 1 chr1:729962:G:C 729962 C G 0 0.00000e+00 NA NA NA 2: 1 chr1:818732:G:A 818732 A G 3486006 2.02698e-41 NA NA NA 3: 1 chr1:856373:C:T 856373 T C 9217 5.42476e-05 0.14892 0.991404 0.880598 4: 1 chr1:861785:G:A 861785 A G 959852855 0.00000e+00 NA NA NA In example 1 (row1) , sample size is wrong (0 instead of 9217 ) and indeed AF is wrong In example 2 (row2) , sample size is wrong (3486006 instead of 9217 ) and indeed AF is wrong In example 3 (row 3), sample size is right but AF is still wrong. Our pgen file has been QCed using among others filters -geno 0.1 -mind 0.1 and -mac 20 in PLINK. We double-checked and for the snps shown above the allele frequencies were recomputed through plink and they are indeed above 0.0001 which is the default filtering threshold in GCTA according to the documentation, so in theory should not be removed. Do you know what could be the cause of this strange computation of allele frequencies and how to solve this? Many thanks!

longmanz commented 3 weeks ago

Hi, Thank you for reporting this. There is indeed an issue in fastGWA when dealing with very rare SNPs (MAC == 1 or 0). This has been reported by other users as well, where the "N" reported for these ultra rare SNPs are wrong. We will have it fixed.

However, this should be lead to wrong MAFs. When running fastGWA, the genotype rate/MAF will be calculated based on the subset of individuals after removing those with missing phenotypes/covariates, so the actual sample size here will probably not be 9217. I am guessing this is the reason here. Can you paste your log file here for us to look into?

SoCadiou commented 2 weeks ago

Hi,

Thank you very much for your reply and for the explanation about the behaviour for super rare variants.

Actually, our total sample size is 9219 and we know that 2 individuals have one missing covariate, making the actual sample size being 9217 (I think that appears in the log.)

Here is our log:

Genome-wide Complex Trait Analysis (GCTA)
version v1.94.1 Linux
Built at Nov 15 2022 21:14:25, by GCC 8.5
(C) 2010-present, Yang Lab, Westlake University
Please report bugs to Jian Yang jian.yang@westlake.edu.cn

Analysis started at 14:05:27 CEST on Wed Apr 03 2024. Hostname: cnode01

Options:

--pfile /center/healthds/pQTL/BELIEVE/Genetic_QC_files/HDS_BELIEVE_final_HDS --maf 0.0001 --geno 0.1 --grm-sparse /exchange/healthds/pQTL/BELIEVE/Genetic_QC_files/GRM/newsparsegrm_0.025 --fastGWA-mlm --pheno /localscratch/10037223.solene.cadiou/phen_seq.10000.28_res --thread-num 10 --out /exchange/healthds/pQTL/results/BELIEVE/gcta/geno_assoc_phen_GCTAdefaultfilterseq.10000.28_res

The program will be running with up to 10 threads. Reading PLINK sample file from [/center/healthds/pQTL/BELIEVE/Genetic_QC_files/HDS_BELIEVE_final_HDS.psam]... 9219 individuals to be included from the sample file. Reading phenotype data from [/localscratch/10037223.solene.cadiou/phen_seq.10000.28_res]... 9217 overlapping individuals with non-missing data to be included from the phenotype file. 9217 individuals to be included. 0 males, 0 females, 9217 unknown. Reading PLINK2 PVAR file from [/center/healthds/pQTL/BELIEVE/Genetic_QC_files/HDS_BELIEVE_final_HDS.pvar]... 16084709 SNPs to be included from PVAR file(s). Threshold to filter variants: MAF > 0.000100, missingness rate < 0.100000. Reading the sparse GRM file from [/exchange/healthds/pQTL/BELIEVE/Genetic_QC_files/GRM/newsparsegrm_0.025]... After matching all the files, 9217 individuals to be included in the analysis. Estimating the genetic variance (Vg) by fastGWA-REML (grid search)... Iteration 1, step size: 0.0157227, logL: -4530.81. Vg: 0.0943365, searching range: 0.0786137 to 0.110059 Iteration 2, step size: 0.00209637, logL: -4530.81. Vg: 0.0890956, searching range: 0.0869992 to 0.0911919 Iteration 3, step size: 0.000279515, logL: -4530.81. Vg: 0.0900739, searching range: 0.0897944 to 0.0903534 Iteration 4, step size: 3.72687e-05, logL: -4530.81. Vg: 0.0899807, searching range: 0.0899434 to 0.090018 Iteration 5, step size: 4.96916e-06, logL: -4530.81. Vg: 0.0899732, searching range: 0.0899683 to 0.0899782 Iteration 6, step size: 6.62555e-07, logL: -4530.81. Vg: 0.0899742, searching range: 0.0899736 to 0.0899749 Iteration 7, step size: 8.83407e-08, logL: -4530.81. Vg: 0.0899738, searching range: 0.0899738 to 0.0899739 Iteration 8, step size: 1.17788e-08, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 9, step size: 1.5705e-09, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 10, step size: 2.094e-10, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 11, step size: 2.792e-11, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 12, step size: 3.72267e-12, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 Iteration 13, step size: 4.96356e-13, logL: -4530.81. Vg: 0.0899739, searching range: 0.0899739 to 0.0899739 fastGWA-REML converged. logL: -4530.81 Sampling variance/covariance of the estimates of Vg and Ve: 0.00348121 -0.00345104 -0.00345104 0.0036305

Source Variance SE Vg 0.0899739 0.0590018 Ve 0.892698 0.0602537 Vp 0.982672

Heritability = 0.0915605 (Pval = 0.127275) fastGWA-REML runtime: 0.269095 sec. Warning: the estimate of Vg is not statistically significant (i.e., p > 0.05). This is likely because the number of closely related individuals in the sample is not large enough. In this case, the program will use linear regression for association test.

Performing fastGWA linear regression analysis... fastGWA results will be saved in text format to [/exchange/healthds/pQTL/results/BELIEVE/gcta/geno_assoc_phen_GCTAdefaultfilterseq.10000.28_res.fastGWA]. 29.0% Estimated time remaining 12.3 min 59.1% Estimated time remaining 6.9 min 89.0% Estimated time remaining 1.9 min 100% finished in 1019.1 sec 16084709 SNPs have been processed. Saved 16082749 SNPs.

Analysis finished at 14:22:45 CEST on Wed Apr 03 2024 Overall computational time: 17 minutes 18 sec.

Thanks again for your help!

longmanz commented 2 weeks ago

Hi, Thank you for sharing the log file. The log seems okay without obvious issues/bugs. May we ask you to check/do a few more things? This will help us better identify where the incorrect AF problem came from.

You mentioned that you have also calculated the AF for the SNPs using Plink. What are the AF reported by Plink for the 4 SNPs you listed in your first thread? Could you share them with us?

As a side note, since you are performing GWAS for rare/ultra-rare variants, you may consider running gene-based tests like GCTA-fastGWA-BB (https://yanglab.westlake.edu.cn/software/gcta/#fastGWA-BB), SAIGE-Gene, Regenie's gene-based test, etc..

You also mentioned that you filtered the SNPs using Plink with -geno 0.1 -mind 0.1 and -mac 20. However, in the log file you shared, the flag for MAF was MAF >= 0.0001, which is equivalent to MAC = MAFsample_size2 = 0.000192172 = ~2. So if we want GCTA to use the same MAC >= 20 cutoff, we need to use --maf 0.001 instead.
Alternatively, you can provide GCTA with a SNP list with the --extract filtered_snp_list.txt flag. The SNP list can be the one you obtained from your previous Plink filtering step.

jianyangqt / gcta

Wrong filtering due to internal allele frequencies computation when using fastGWA #83