Closed cdeboever3 closed 5 years ago
@guhanrv pointed me to the notebook compare_cutoff_pvals
. This notebook relies on results from the notebook check_firth
. The disease prevalence values from check_firth
are incorrect though. For instance, if you look compare the number of cases in the log files at private_output/PLINK_results/*log
and the numbers in output/check_firth/phe_freq.csv
they don't match. The numbers in private_output/PLINK_results/*log
do match the numbers in data/traits.tsv
though. It seems that Julia used 351459 subjects when calculating the diseases prevalences rather than the 337208 samples used for GWAS in plink.
Using the correct trait counts, these traits were included but shouldn't have been (if the cutoff was 500):
regtype category numcases phenotype
HC69 logistic HC 482.0 polycythaemia_vera
HC432 logistic HC 487.0 mitral_valve_prolapse
HC421 logistic HC 478.0 other_abdominal_problem
HC352 logistic HC 474.0 systemic_lupus_erythematosis/sle
HC12 logistic HC 492.0 testicular_problems_(not_cancer)
If we switch the cutoff to 470, we are missing these three traits:
regtype category numcases phenotype
HC278 logistic HC 477.0 cerebral_aneurysm
HC375 logistic HC 477.0 alcohol_dependency
HC256 logistic HC 476.0 fracture_toe
Given that the traits we included are somewhat arbitrary anyway, I think it's fine to just say the cutoff is 470 and not include these three traits.
For the allele frequency, it seems that this was done correctly. The frequencies were calculated from the file private_output/print_rds/ukb_hla_v2_rounded_remove.txt
which has 337,208 rows.
The paper currently says something like
However, I'm not quite sure where these numbers come from.