jianyangqt / gcta

GCTA software
GNU General Public License v3.0
73 stars 23 forks source link

Runtimes are grossly out of line with the article #64

Closed labxpub closed 6 months ago

labxpub commented 6 months ago

Hello there, I am running gwas with your gcta-glmm and here is my log file. My sample size is just over 300,000 and the number of SNPs is only 1037,533, but I've been running for two hours also I've been stuck at the screen below, and then my MEMORY and CUPS are running, which is not at all the same as the runtime you guys described in your article. I don't know what's wrong with me, can I ask you guys?

Options:

--bfile chr2_QC --grm-sparse /ukb/qqpcmr_system/gwas/chr2_sp_grm --fastGWA-mlm --pheno /ukb/qqpcmr_system/gwas/calcium.txt --thread-num 72 --out /ukb/qqpcmr_system/gwas_output/geno_calcium_chr2

The program will be running with up to 72 threads. Reading PLINK FAM file from [chr2_QC.fam]... 337056 individuals to be included from FAM file. Reading phenotype data from [/ukb/qqpcmr_system/gwas/calcium.txt]... 337056 overlapping individuals with non-missing data to be included from the phenotype file. 337056 individuals to be included. 156073 males, 180983 females, 0 unknown. Reading PLINK BIM file from [chr2_QC.bim]... 1037533 SNPs to be included from BIM file(s). Reading the sparse GRM file from [/ukb/qqpcmr_system/gwas/chr2_sp_grm]... After matching all the files, 337056 individuals to be included in the analysis. Estimating the genetic variance (Vg) by fastGWA-REML (grid search)...

longmanz commented 6 months ago

You need to make sure the sparse GRM is correctly calculated. You can check the row number difference between your .grm.sp and .grm.id files, which indicates the number of highly related pairs in your dataset. If this number is a few times larger than your sample size, it usually indicates there is something wrong with the sparse GRM. For UKB (N = ~400k), the number is ~150k to 200k.

When calculating sparse GRM, make sure you only used SNPs with MAF >= 0.01 and passed other standard QC. In addition, make sure the individuals are from the same ancestry background. You should use Information from field 22006 for example to get Caucasians. Or you can use 1000 Genome to predict accurate genetic background on your own. Do not use self-reported ancestry information from UKB since this is inaccurate.

labxpub commented 6 months ago

Hi, I went to test it according to your scenario, but since the sample size is so large calculating the GRM for all the samples takes a long time, so I'm just using 1000 samples for now to do the calculation of the GRM matrix first. In the end I got a grm.sp file line count of 1178 using 0.1% as a threshold for QC and 1319 for that file using your recommended 1% as the MAF filtering threshold. this seems a bit inconsistent with what you are saying. As well as a thousand samples I can run fastgwas-glmm even using 0.1% as the MAF threshold, but it doesn't work with 300,000 samples. Now I don't know which threshold to follow to do it down the road, and wanted to ask your guys' opinion.