Forcing fastgwa to run MLM if heritability is not significant

chenpunk commented 8 months ago

Hello, I'm trying to run fastGWA and I have a few questions:

Does it estimate h2g after regressing out covariates in the fastGWA-REML step? Would the heritability be soaked by PCs etc.? What I should do before estimating heritability(i.e GREML analysis or fastGWA-REML)?
Is there a way to force it to run the MLM in this case when estimated heritability that is not significant?

I'm interested because I'm practising GWAS and PRS analysis by dividing 150k sample with phenotype into two groups. One group(100k) for GWAS and the other(50k) for PRS.

As for heritability, I have run fastGWA on both 150k sample group and 100k sample group and I notice that fastGWA-REML generate heritabilities of 0.139051 (Pval = 1.11898e-06) and 0.0682159 (Pval = 0.108751) in two groups. I'm puzzled with it and wonder which heritability should be reported.

As for GWAS, fastGWA runs a linear regression instead of the mixed model when estimated heritability is not significant. LR seems to be a proper method if there are less related individuals after dividing the sample. Despite of smaller sample of related individuals by reducing the sample size, I fear that the result of simple LR will be largerly differed from that of MLM. However, I'm not sure about it because MLM is still running.

This might not be a technical problem of GCTA but I find it hard to get any reply in GCTA forum. I would be very appreciated if you can give me some advice. And I will compare the GWAS result of MLM and LR later.

longmanz commented 8 months ago

Hi, Thank you for using the tool. If you believe that the h2 estimated by fastGWA is not correct (or if the h2 is not statistically significant but you think otherwise), you can use flag "--ge ${Vg} ${Ve} " (Vg and Ve are the genetic variance component and residual component estimates you believe are correct) to force fastGWA use these 2 values for MLM analysis. In that way, fastGWA will skip the REML process and directly use your input.

That being said, I would suggest you do the following checks:

It is possible that the reason why the h2 in your 100k sample is not significant is that you do not have enough related pairs in this dataset. You can check the line difference between your .grm.sp and .grm.id files. That value is the number of related pairs in your dataset. Is that number a reasonable value? Usually we would expect a smaller values than the total sample size in your dataset.
The h2 estimated by fastGWA is not an appropriate estimate for the heritability. Not only because this is estimated by a sparse matrix, but also because you should not report any "h2" from a MLM GWAS that include related pairs. You should run a legit GREML analysis in a subset of unrelated samples of your data. Or you can estimate it after you obtain your GWAS results using summary-based tools like LD score regression.
You mentioned that your MLM is still running. This is odd because we would expect a similar runtime between LR and MLM in fastGWA. I suspect your sparse GRM is not correctly generated. Please make sure that your sample all have the same genetic ancestry background, and you are using QCed HapMap3 common SNPs to generate the GRM. Using participants from different genetic ancestries or using unQCed SNPs (or rare SNPs) will lead to inflated relatedness coefficients in the GRM, and the sparse GRM will be falsely denser (related to point 1), increasing the runtime of fastGWA.

chenpunk commented 8 months ago

Thank you for the explaination on heritability and I will look into the methods and steps on how to estimate it.

I need to clarify the running time of MLM first. I was practicing GWAS using imputed data of UKB on a server in my university so "MLM still running" I mentioned last time meant that MLM on 150k samples was queuing. The actual running time of LR and MLM were close (MLM: 4h42m and LR: 5h30min).

However, now I have got more questions after comparing GWAS results of MLM (150k) and LR (100k):

Result of LR and MLM: LR: MLM: As you see, I still got a inflated qq plot after employing MLM to control the relatedness of sample. And here is how I conducted MLM. First, I generated 50 parted GRM using the snplist you offered in issue#67 in 457946 white individuals(British, Irish and any other white background in ukb data field 21000) after removal based on sex chromosome aneuploidy + heterozygosity then combined parted GRM together. Next, I generated sparse GRM using the threshold value at 0.05. Third, I QCed 150k sample with phenotype data and variants using thershold at missingness 0.10, maf 0.01, hwe 1e-50 and INFO 0.3. Last, I included PCs and other covariates to run fastgwa-MLM in my QCed sample. Which steps might be the problems that causing this?
Sample selection when generating a GRM: Does the individuals included in GRM affect the relatedness coefficients of the other independent pairs? For example, relatedness coefficient of indvidual A and B is 0.5, and including or excluding an individual C does not change estimated coefficient when generating GRM. In my scenario, I created a GRM based on 457946 individuals and applied it to a subset of 150k individuals. Is that reasonable? I did this because generating GRM was really time-consuming and it would be unpractical for me to generate each GRMs for phenotypes available in different subsets.
SNP selection when generating a GRM: I used the snplist you offered in issue#67. I'm not sure that if I need QC them in my GRM-generating sample instead of using it directly. I think HAPMAP3 has already been qced but maf, hwe or something else might be different in my own sample.
Something related to your last comment point 1: The sparse matrix now is based on 457946 individuals and it has 10722449 lines. The number of related pairs is more than sample size. And I don't know what it be like after reducing to 150k sample.

Thanks again for your patience!

longmanz commented 8 months ago

Hi, Thank you for the detailed information. This is very helpful.

Re 1 and 4). Given that you are using data field 21000 and the line number in grm.sp file is very large, the problem is driven by the genetic ancestry of your samples are not homogeneous. In our 2019 paper, we briefly discussed that the self-reported ancestry in UKB (21000) does not always match the true genetic ancestry in UKB. I would recommend you to restrict your samples to those listed in 22006 (https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=22006). 22006 was generated by UKB and contains a list of individuals with the same Caucasian genetic ancestry. This is a more suitable subset for your GWAS analysis. The GRM of cross-ancestry data cannot be correctly calculated in GCTA or other similar MLM tools.

Re 2). Yes, this is fine. But given 1) and 4), you will need to regenerate your GRM using samples listed in 22006. The reason for that is when you are generating GRM, there is a step to scale each SNP using its variance and mean. If individuals with different genetic background are included, the variance and mean cannot be correctly calculated.

Re 3). Using that list should be fine since you are generating GRM for the participants with Caucasian genetic ancestry. No need to perform further QC.

chenpunk commented 8 months ago

Thank you very much for the help, this should answer on my questions. Thanks again!

chenpunk commented 7 months ago

To add on this issue: I have re-run the grm based on caucasian genetic ancestry and the sparse matrix now has about 600k lines. This seems to be very reasonable. Next, I have re-run fastGWA-mlm based on new subset of 132k individuals and got a inflated QQplot(lambda = 1.13). After that, I run LDSC regression to check if the inflation is caused by polygenicity. The intercept is 1.0131 and the ratio is 0.0876. In this case, I think fastGWA-mlm has controlled relatedness confounding pretty well. Hope this could help anyone who run into the same problem like me.

jianyangqt / gcta

Forcing fastgwa to run MLM if heritability is not significant #74