rivas-lab / hla-assoc

MIT License
0 stars 0 forks source link

Understand phenotype and allele filtering #6

Closed cdeboever3 closed 5 years ago

cdeboever3 commented 5 years ago

The paper currently says something like

We included 175 allelotypes for 11 loci that had a frequency of 0.1% or greater in this cohort. We defined a set of diseases by XXX and included 270 diseases with at least 500 patients in this cohort.

However, I'm not quite sure where these numbers come from.

cdeboever3 commented 5 years ago

@guhanrv pointed me to the notebook compare_cutoff_pvals. This notebook relies on results from the notebook check_firth. The disease prevalence values from check_firth are incorrect though. For instance, if you look compare the number of cases in the log files at private_output/PLINK_results/*log and the numbers in output/check_firth/phe_freq.csv they don't match. The numbers in private_output/PLINK_results/*log do match the numbers in data/traits.tsv though. It seems that Julia used 351459 subjects when calculating the diseases prevalences rather than the 337208 samples used for GWAS in plink.

cdeboever3 commented 5 years ago

Using the correct trait counts, these traits were included but shouldn't have been (if the cutoff was 500):

        regtype category  numcases                         phenotype
HC69   logistic       HC     482.0                polycythaemia_vera
HC432  logistic       HC     487.0             mitral_valve_prolapse
HC421  logistic       HC     478.0           other_abdominal_problem
HC352  logistic       HC     474.0  systemic_lupus_erythematosis/sle
HC12   logistic       HC     492.0  testicular_problems_(not_cancer)

If we switch the cutoff to 470, we are missing these three traits:

        regtype category  numcases           phenotype
HC278  logistic       HC     477.0   cerebral_aneurysm
HC375  logistic       HC     477.0  alcohol_dependency
HC256  logistic       HC     476.0        fracture_toe

Given that the traits we included are somewhat arbitrary anyway, I think it's fine to just say the cutoff is 470 and not include these three traits.

cdeboever3 commented 5 years ago

For the allele frequency, it seems that this was done correctly. The frequencies were calculated from the file private_output/print_rds/ukb_hla_v2_rounded_remove.txt which has 337,208 rows.