rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
189 stars 55 forks source link

Support for multiple mask and mask hirerachy in burden tests #448

Open jerome-f opened 1 year ago

jerome-f commented 1 year ago

Hi, I recently noticed that regenie allows for variants to be part of more than one gene set, but does not accept multiple annotations for the same variant-gene pairs. i.e. something like

rsid1 gene1 missense rsid1 gene1 splice_variant rsid1 gene1 conserved

I understand that having duplicate annotations for the same variant-gene will lead to the issue of double counting. But this can be handled by specifying an hirerachy e.g plof>missense>synonymous, that can be specified by the user and when combining masks to test in case of multiple annotations regenie can select for the one based on the provided hirerachy. I can see multiple use cases where this feature might come in handy

joellembatchou commented 1 year ago

Hi,

I am not sure I understand what your question is. The issue title is on "multiple masks" but in the text you are talking about multiple annotations per variant/gene pair.

Can you give a detailed example of the use-case you have (e.g. annotations for the variant/gene pair and set of masks you want to evaluate)?

Cheers, Joelle

jerome-f commented 1 year ago

Hi Joelle,

The use case in the above scenario would be:

Annotations:

rsid1 gene1 missense rsid1 gene1 splice_variant rsid1 gene1 conserved rsid2 gene1 plof rsid1 gene2 splice_variant rsid2 gene2 missense

setlist:

gene1 rsid1,rsid2 gene2 rsid1,rsid2

maskdef:

plof plof plof_splice plof,splice plof_miss plof,missense splice splice conserved conserved

the masks plof_splice and plof_miss will have conflict as the same variant can have two different consequence for different genes. (I am not exactly certain how frequent these happen in dbNSFP).

joellembatchou commented 1 year ago

Hi,

The same variant can have different annotations for different genes but not different ones for the same gene. I think a workaround would be to use the most deleterious annotation so you have a single annotation per-gene. Alternatively, Perhaps the 4-column annotation file format (designed initially for protein domains) could be useful here (it allows for different annotations for the same variant in a gene across domains)?