pgxcentre / genetest

Python package for efficient genetic association analyses
Other
3 stars 2 forks source link

Any documentation of the results provided by the genetest using survival GWAS #14

Open yzharold opened 6 years ago

yzharold commented 6 years ago

Hi, Genetest team,

I recently conducted the analysis using Genetest, and I got a results file which contains 15 columns as below,

           snp             chr          pos major minor        maf               n        ll         coef

22:16231001_TA_T 22 16231001 TA T 0.01176545 1068 -922.6917 0.1897784 se hr hr_lower hr_upper z p 1 0.7915108 1.208982 0.2562622 5.703677 0.2397673 0.8105107

I wonder whether you would have a document to explain what are the columns especially, how you obtain the column with n, since I want to compare results from R code. And my calculation was based on the imputation dosage date from impute2 and impute2_info files.

Really appreciate your help.

Harold

lemieuxl commented 6 years ago

Hi Harold,

Here is a quick description of the columns in the output file. I'll try and update the documentation as soon as possible.

The apparent discrepancy between R and genetest n values would come from the probability threshold. Setting lower quality imputed genotypes to missing will decrease the number of samples for the analysis.

The aim of the probability threshold is to exclude poor quality imputed genotypes.

To obtain the same results as R, you can set the probability threshold to 0 (see the documentation about the impute2 parser for more information).

yzharold commented 6 years ago

Thanks a lot, lemieuxl, for your clarification of the columns, for my case, I already pre-selected SNVs with the imputation score greater than 0.5 for example, also for my understand the info score used by impute2 program is based on the variants, not the individual level, I am a little confused about your comment on " the genotype reader changes imputed genotypes with probability lower than 0.9 to missing"? Fox example, given a particular SNV, all the individuals will be imputed, if the info score (imputation quality) is greater than info=0.9, I would use all the dosages calculated from the impute2 for all the individuals for this given marker (in this case, there is no missing dosage, every individual should have values, unless the imputation was failed for particular individual, which is really rare.), no matter how small the dosage would be, given each individual, in R, I calculated the allele dosage as 0probAA + 1probAB + 2*probBB, where probAA, probAB and probBB are reported from impute2 program for each SNV. Could you be more specific about your process how you generate n given your probability threshold 0.9, how do you calculate your probabilities for each individual to compare with your probability threshold (0.9 as default)?

Thanks,

Harold

yzharold commented 6 years ago

The formula should be dosage of allele B = 0probAA + 1probAB + 2*probBB dosage of allele A = 2 - dosage of allele B Enforce the allele dosage together is 2 in theory.

lemieuxl commented 6 years ago

Yes, the information score provided by impute2 is computed at the variant level. Hence, it gives an idea of the variant's imputation quality. The tool does not filter variants based on imputation quality, you need to use the --extract option to manually perform the analysis on a selected subset of markers, if required. Note that we use an index, hence extracting markers is faster than parsing the whole file.

We compute the dosage for the minor allele, so that regression coefficient represents a change according to the "number" of the minor allele (e.g. like what Plink is doing). Here is a summary of what the algorithm is doing:

  1. It computes the dosage of the alternative allele (dosage_alt = 2*prob_alt,alt + prob_ref,alt).
  2. It removes dosage values computed from poor probabilities (see below for a description) by setting the value as missing (i.e. no call).
  3. It merges genotypes and phenotypes.
  4. It excludes missing values.
  5. It computes the MAF from the dosage vector and change encoding (2 - dosage_alt,alt) if the MAF is higher than 0.5 (so that values close to 2 always represent a homozygous genotypes of the minor allele).
  6. It performs the regression.

Here is what we do to determine if a dosage value for a sample was computed from poor probabilities. There are three probabilities for each sample (prob_ref,ref, prob_ref,alt and prob_alt,alt). If the maximum of those three values is lower than the user defined threshold (0.9 by default), we set the dosage to NA (missing, no call).

Example 1: The sample has the three probabilities 0, 0.01 and 0.99. The maximal probability is 0.99, hence the dosage value will be 1.99.

Example 2: The sample has the three probabilities 0.1, 0.5 and 0.4. The maximal probability is 0.4, which is lower than the default threshold of 0.9. The dosage is set to missing. If the threshold is set to 0, then the dosage would be 1.3.

This has for effect to reduce the number of samples used for the analysis (hence different values for n). If you want to keep all the dosage values, just set this probability threshold to 0, so that it matches the results obtained by R (no probability filtering).