A question - Githubissues

TrumanZYX commented 1 year ago

Dear Phil Greer, Thanks for your help. But I have a questions. Why is Harwin Equilibrium Analysis Not Performed During Quality Control? Like this: Pilnk2 --file test --hardy --out ?

pjgreer commented 1 year ago

Feel free to add it to your script when you run it, but the filter flag is: --hwe p-value where p-value is the cutoff for what you will exclude from your analysis.

In my experience, I find filtering by Hardy-Weinberg Equilibrium to remove risk variants and variants related to population structure that I would prefer to keep in the analysis. This is especially bad in such a large cohort as the UK biobank where even very minor deviations from HWE are considered highly significant. If you do filter on HWE you must use a very small p-value due to the massive N in the UKB. In my testing, I was using 1e-30 and still getting too many false positives. You probably should use something like 1e-60.

You can read more about issues with HWE here: https://www.frontiersin.org/articles/10.3389/fgene.2017.00167/full and here: https://www.frontiersin.org/articles/10.3389/fgene.2020.00210/full

pjgreer commented 1 year ago

Remember all the parameters I used in these analysis are starting points. feel free to edit them for your specific use case. As for your questions:

I cannot seem to find it right now, but I paper came out a few years go that stated MAF as low as 0.001 were quite reliable in the UK Biobank V3 dataset. I chose the 0.006 cutoff to include edge cases where the MAF in cases may be over 1% but under 1% for controls.
The v2 genotyping array includes 700K+ snps. The whole point of the first 3 scripts using the V2 data (GT_FILE_PREP folder) is to create the minimal data file that will be used to construct a genetic relationship matrix and blocking for regenie step 1. The regenie paper recommended pruning and I used slightly different parameters. The rule of thumb is that the input file for regenie step 1 needs to be 1) genome wide, and 2) have roughly the same number of markers as the number of subjects. I suspect between 400K and 500K is sufficient regardless of the number of subjects, but that has not been tested. By trimming ~300k markers in high LD with other markers, we maintain the relationship and population structure in the data, reduce the file size, and make downstream processes faster (liftover on 410K markers is much, much faster than liftover on 700K+ markers if you plan to analyze the exome data).
I did not combine the the two plink commands in 11* because an earlier versions of plink 2 were inefficient with processing bgen files and much better analyzing the pgen files. This was the recommendation of the author of plink2 as of 2020. Further, stating the number of threads is dependent on the number of CPUs in your VM. By default, plink will use as many threads as available in the VM and you do not need to explicitly state the number of threads.
I admit I have only run about 6 GWAS in both plink2 and regenie, the p-values may be slightly different, but the location of the peaks and the reported betas and odds ratios remain consistent between the methods. I have yet to run into an instance where the results do not agree between methods. If they do not agree, then you may have a bigger problem with your inputs or your assumptions.

Hope this helps.

pjgreer commented 1 year ago

I would not recommend merging the data after v3 qc. (that would be script 11 if memory serves correctly)

A merged v3 file will be very large, ~11 million markers X the number of subjects you are using is very big and will run as a single job. It is still advisable to run 23 separate regenie jobs for the analysis (1 per chromosome) and then combine the association files afterward.

Most PRS/PGS calculation programs operate on the post GWAS summary statistics, not on the raw data itself.

This error you are getting is due to multi-allelic markers. These markers are hard to analyze and interpret. Luckily there are relatively few of these markers and most researchers either exclude them from their analysis or put them in a file for a separate analysis. If you just decide to exclude them, in plink1.9 you would add the flag --biallelic-only to the merge command, whereas in plink2 you would use the flag --max-alleles 2 to the merge command.

pjgreer commented 1 year ago

I am sorry, I am confused why you need to make a combined chromosome file for 400K+ subjects. I do not, and would not recommend making a single file for analysis. This will be a multiple TBs when it is finally written. It will take much longer to run regenie on the combine chromosome file rather than run it on each chromosome individually.

I thought you were finished with step 11, 11a-gwas-s2-imp37-qc-filter.sh. The next step 12a-gwas-s2-imp37-regenie.sh is to run regenie on each chromosome and then finally run 13a-gwas-imp37-result-merge-regenie.sh to combine the result file into a single result.

If I understand correctly, you want to add a PRS step. This would be run between 12a and 13a. call it 12.5a if you wish. In this step, you would run some PRS calculation on each individual chromosome, the same as in the regenie workflow, and then once completed, combine the results of the individual chromosomes into a single file similar to 13a, but specifically for the PRS outputs. If the method for generating a PRS score must have all the data in a single file, then, IMHO, that is a very poor method as it will not scale well for very large datasets.

If I am wrong, please explain what step you are on in the regenie workflow and exactly what you want to accomplish.

Last, these errors are mainly plink errors and you should look into the merge function on the plink documentation site: https://www.cog-genomics.org/plink/2.0/data#pmerge specifically the section on --merge-max-allele-ct

pjgreer / ukb-rap-tools

A question #2