Closed TrumanZYX closed 1 year ago
Feel free to add it to your script when you run it, but the filter flag is: --hwe p-value where p-value is the cutoff for what you will exclude from your analysis.
In my experience, I find filtering by Hardy-Weinberg Equilibrium to remove risk variants and variants related to population structure that I would prefer to keep in the analysis. This is especially bad in such a large cohort as the UK biobank where even very minor deviations from HWE are considered highly significant. If you do filter on HWE you must use a very small p-value due to the massive N in the UKB. In my testing, I was using 1e-30 and still getting too many false positives. You probably should use something like 1e-60.
You can read more about issues with HWE here: https://www.frontiersin.org/articles/10.3389/fgene.2017.00167/full and here: https://www.frontiersin.org/articles/10.3389/fgene.2020.00210/full
Remember all the parameters I used in these analysis are starting points. feel free to edit them for your specific use case. As for your questions:
Hope this helps.
I would not recommend merging the data after v3 qc. (that would be script 11 if memory serves correctly)
A merged v3 file will be very large, ~11 million markers X the number of subjects you are using is very big and will run as a single job. It is still advisable to run 23 separate regenie jobs for the analysis (1 per chromosome) and then combine the association files afterward.
Most PRS/PGS calculation programs operate on the post GWAS summary statistics, not on the raw data itself.
This error you are getting is due to multi-allelic markers. These markers are hard to analyze and interpret. Luckily there are relatively few of these markers and most researchers either exclude them from their analysis or put them in a file for a separate analysis. If you just decide to exclude them, in plink1.9 you would add the flag --biallelic-only
to the merge command, whereas in plink2 you would use the flag --max-alleles 2
to the merge command.
I am sorry, I am confused why you need to make a combined chromosome file for 400K+ subjects. I do not, and would not recommend making a single file for analysis. This will be a multiple TBs when it is finally written. It will take much longer to run regenie on the combine chromosome file rather than run it on each chromosome individually.
I thought you were finished with step 11, 11a-gwas-s2-imp37-qc-filter.sh
. The next step 12a-gwas-s2-imp37-regenie.sh
is to run regenie on each chromosome and then finally run 13a-gwas-imp37-result-merge-regenie.sh
to combine the result file into a single result.
If I understand correctly, you want to add a PRS step. This would be run between 12a and 13a. call it 12.5a if you wish. In this step, you would run some PRS calculation on each individual chromosome, the same as in the regenie workflow, and then once completed, combine the results of the individual chromosomes into a single file similar to 13a, but specifically for the PRS outputs. If the method for generating a PRS score must have all the data in a single file, then, IMHO, that is a very poor method as it will not scale well for very large datasets.
If I am wrong, please explain what step you are on in the regenie workflow and exactly what you want to accomplish.
Last, these errors are mainly plink errors and you should look into the merge function on the plink documentation site:
https://www.cog-genomics.org/plink/2.0/data#pmerge specifically the section on --merge-max-allele-ct
Dear Phil Greer, Thanks for your help. But I have a questions. Why is Harwin Equilibrium Analysis Not Performed During Quality Control? Like this: Pilnk2 --file test --hardy --out ?