jianyangqt / gcta

GCTA software
GNU General Public License v3.0
73 stars 23 forks source link

GRM matrix rows are too large for fastGWAsglmm to run. #67

Closed labxpub closed 6 months ago

labxpub commented 6 months ago

Hello, I'm using your fastGWA-glmm, it's a great tool and helps me a lot. But currently I'm running into some problems:

My sample size is just over 300,000 and the number of SNPs is only 1037,533, but I've been running for two hours also I've been stuck at the screen below, and then my MEMORY and CUPS are running, which is not at all the same as the runtime you guys described in your article. I don't know what's wrong with me, can I ask you guys?

Options:

--bfile chr2_QC --grm-sparse /ukb/qqpcmr_system/gwas/chr2_sp_grm --fastGWA-mlm --pheno /ukb/qqpcmr_system/gwas/calcium.txt --thread-num 72 --out /ukb/qqpcmr_system/gwas_output/geno_calcium_chr2

The program will be running with up to 72 threads. Reading PLINK FAM file from [chr2_QC.fam]... 337056 individuals to be included from FAM file. Reading phenotype data from [/ukb/qqpcmr_system/gwas/calcium.txt]... 337056 overlapping individuals with non-missing data to be included from the phenotype file. 337056 individuals to be included. 156073 males, 180983 females, 0 unknown. Reading PLINK BIM file from [chr2_QC.bim]... 1037533 SNPs to be included from BIM file(s). Reading the sparse GRM file from [/ukb/qqpcmr_system/gwas/chr2_sp_grm]... After matching all the files, 337056 individuals to be included in the analysis. Estimating the genetic variance (Vg) by fastGWA-REML (grid search)...

I checked that there are too many rows in the GRM matrix (about 10 million) i.e. too many correlations between individuals, but I strictly followed the white race selection based on the Caucasian race in UKB field 22006 and removed Ten or more third-degree relatives in UKB field 22021. identified (over-relative) individuals, but the problem still arises. Your earliest article fastGWAs should have taken 450,000 individuals directly smoothly to get more than 200,000 individuals plus about 170,000 a GRM rows, theoretically our individuals should be a subset of yours why there is such a big difference, I hope you can give us some advice.

labxpub commented 6 months ago

To add to this, I got all the above steps from calculating on a single chromosome, and then I did fastGWAs-GLMM on a single chromosome, so it just kept failing to get the results, maybe that's bringing in the bias? I'm immediately using all 22 chromosomes now, but I'm finding it unusually slow, divided into 100 parts, and it takes me 300 minutes to run a single part, I'm already using 32threads, is there anything else I can do to help speed it up other than continuing to add more threads?

longmanz commented 6 months ago

Hi, Getting 10 million rows in sparse GRM for the UKB does not seem right. Have you used 0.05 as the threshold of relationship coefficient to get the sparse GRM?
Can you also check if you used the same QC criteria for your SNPs when calculating the GRM? SNPs should be MAF >= 0.01, genotype missing rate <= 0.1, and HWE p-value >= 1e-6. The first 2 criteria are the most important because they will inflate the estimation of relationship coefficients. You do not need to remove any relative from field 22021, since fastGWA is designed to handle relatedness.

labxpub commented 6 months ago

Thanks for the answer, I am currently screening for MAF ≥ 0.01, genotyping missing rate ≤ 0.05 and hwe pvalue ≥ 1e-6. and I am sure my threshold is set to 0.05. I would also like to ask if I can use the SNPs on a single chromosome to do the calculation of the GRM matrix using this matrix to run a GWAS on a single chromosome, which is what I am currently doing, and then the GRM matrix is pretty huge.

The reason for doing it as above is that if we calculate the GRM matrix using all SNPs (we have eight million SNPs), divided 100part a part to 72threads, this all takes 200 minutes, which seems to be a big difference from what is mentioned in the original paper.

I hope you can give me some suggestions.

labxpub commented 6 months ago

I would also like to ask you guys if I can divide each chromosome to do the calculation, and then finally give the GRM matrix obtained from each chromosome to be concatenated, but I didn't find any similar operation at the moment, I don't know if you guys have any tutorials on the standard operation, if you have hopefully you can share it, thank you very much!

longmanz commented 6 months ago

Hi, In the paper we actually restrict the SNPs to slightly-pruned HapMap 3 SNPs (with MAF >= 0.01, genotype missing rate <= 0.1, and HWE p-value >= 1e-6): The HapMap3 CEU SNP list contains ~1m common SNPs. In our paper, we additionally pruned these SNPs with r2 = 0.9, which led to ~600k SNPs in the final GRM calculation. This is much smaller than the 8m SNPs you are using. I have attached the SNP list that we used for UKB GRM here, and you may use it for your GRM calculation.

We would not recommend doing chromosome-wise analysis, although a sparse GRM from a single chromosome might be very similar to the real one. This is because we have not assessed it formally. For smaller chromosomes, the estimated GRMs might not be that accurate.

labxpub commented 6 months ago

Thank you very much for your help, this should answer my question. Thanks again!