rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
187 stars 55 forks source link

Rationale behind AAF and not MAF? #256

Closed Ojami closed 2 years ago

Ojami commented 2 years ago

Hi Joelle,

I was wondering why REGENIE relies on AAF instead of MAF in region-/gene-tests? MAF seems more intuitive compared to AAF (similar to what SAIGE-GENE does)? This can be problematic when AA is not the MA (or am I wrong?)

As an example:

#CHROM  ID  REF ALT ALT_CTS OBS_CT
21  21:10413783:A:G A   G   352977  353112
21  21:10413787:C:T C   T   10  355362

21:10413783:A:G and 21:10413787:C:T cannot be used in the same mask (by setting --aaf-bins), if one's interested in testing only rare variants (MAF < 1%).

Here, I can see that the authors mentioned:

For each of these groups, we considered five separate burden masks per gene, based on the frequency of the alternative allele of the variants that were screened in that group: MAF ≤ 1%, MAF ≤ 0.1%, MAF ≤ 0.01%, MAF ≤ 0.001%, and singletons only.

To me, it seems they mixed up AAF with MAF. The only (?) workaround is --aaf-file, where user flips REF/ALT of the variants with an AAF > 0.5, and uses MAF (1 - AAF) for those variants. This essentially means, the user should go against this part:

Each line contains the variant name followed by its AAF (it should correspond to ALT allele used in the genetic data input).

On a different note, the description of --vc-maxAAF optional argument states:

AAF upper bound to use for SKAT/ACAT-type tests [default is 100%]

Does this mean this cutoff should be in % and not absolute AAF (e.g. if one wants rare below AAF of 0.01, s/he should set this option as 1)?

Thanks/Oveis

joellembatchou commented 2 years ago

Hi Oveis,

Yes, Regenie uses AAF to determine which variants will go into the set-based tests. This should not be an issue when using reference genome as the reference allele is usually major (all the more if narrowing down to rarer variation as done in the paper you referenced).

For the --vc-maskAAF option, indeed the default upper bound used is 100% (=1) meaning all variants go into the test regardless of AAF. Similarly to the --aaf-bins option, you should specify the absolute AAF (e.g. --vc-maxAAF 0.01 to have only variants with AAF below 1% in the test).

Cheers, Joelle

Ojami commented 2 years ago

Hi Joelle,

Thanks for clarification. In the example above, those two variants are indeed from PLINK BED files (PLINK --freq counts), and both are rare. As seen, the ALT is not the minor allele for the first variant; however, I admit that this is not mostly the case, and for the majority of variants ALT == minor (at least in case of UKB WES data vcf -> BED from DNAnexus).
Nonetheless, with having MA instead of ALT, one doesn't need flag --singleton-carrier, and it would be more intutive when using MAC specific options (e.g. --vc-MACthr).

I close this issue, since it doesn't greatly affect the output summary stats anyway.

Best/Oveis

burulca commented 1 year ago

Hello @joellembatchou ,

We are setting up rare variant burden test with Regenie for UKBB WES data, and came across the same issue described here (that for some variants A1 is not the minor allele even when using reference genome), although as stated above, it is quite rare. Still I wanted to check, has there been any change about this in Regenie since last year? If I understand correctly, even if we provide an AAF file manually where we replace the AAF as (1-AAF) for those cases so that they go into the aaf-bins, the beta and A1FREQ would still refer to the major allele, so basically we should not do that, right? So the only solution if we really want all rare variants (e.g. with AAF>99.99% and AAF<0.01%) would be to recode the alleles? Thank you!

Best, Burulca

joellembatchou commented 1 year ago

Hi Burulca,

No change has been done in the REGENIE software. As you stated, the AAF file is only to specify which variants goes into a mask but does not flip the alleles when computing the mask so yes, recoding the alleles would be an alternate solution.

Cheers, Joelle