Integrate recent UKB exome study results

eric-czech commented 3 years ago

Are there any plans to integrate the UKB RVAS/GWAS results from the genebass and AstraZeneca 300k exome association studies?

Along those lines, I was curious if you all had any thoughts on how to weight evidence from these studies since the varying types of tests used here certainly hint at varying degrees of biological significance, e.g. missense vs LoF hits, variant vs gene level groupings, protective vs risk alleles, burden vs SKAT associations, etc.

Thanks!

MayaGhoussaini commented 3 years ago

Yes. We are planning to integrate the UKB RVAS but we’re still in the very early stages of planning. Several groups have now published results from the exome study so our first priority would be to evaluate the different methods /gene-level variant aggregation criteria used to generate the results for the 300k exome data. We will then integrate the sumstats from one of the studies. If you have any views as to which study has the most useful results, then now is a great time to provide that feedback.

Regarding weighting evidence from these data, one of the plans we have in mind is to use the burden test results as predictors in our machine learning model to improve causal gene assignment at GWAS-associated loci (Locus to gene score). We haven’t started working on this yet, so again we’d be happy to hear from you if you have any suggestions going forward.

eric-czech commented 3 years ago

If you have any views as to which study has the most useful results, then now is a great time to provide that feedback.

Here are a few notes on differences between them:

Number of traits:
- total: AZ = 18,780 Genebass = 3,700
- binary traits: AZ = 17,361 Genebass = 2,583
- quantitative traits: AZ = 1,419 Genebass = 1,117
Number of gene-level collapsing models: AZ = 12 Genebass = 3
- AZ models are here, Genebass uses just LoF, missense, and synonymous
Thresholds: AZ = 2 × 10−9 (for gene and variant tests), Genebass = 2.5 x 10-8 for SKAT-O tests, 6.7 x 10-7 for burden tests, and 8 x 10-9 for single variant tests
- Note that the AZ gene-level filter is more stringent on gene-level tests by one or two orders of magnitude
Significant variant-level associations:
- total: AZ = 46,947 Genebass = 27,421
- binary traits: AZ = 5,193, Genebass = NA (they don't say)
- quantitative traits: AZ = 41,754 , Genebass = NA (they don't say)
Significant gene-level associations:
- total: AZ = 1,703 Genebass = 4,560
- binary traits: AZ = 936, Genebass = NA (they don't say)
- quantitative traits: AZ = 767, Genebass = NA (they don't say)
Visibility
- The AZ UI is very minimal by comparison to the Genebass UI
- The information the Genebass UI exposes is really impressive, so I imagine many OT users would find that context quite helpful if they could link through to it

Overall I don't have a strong opinion yet until (hopefully) Genebass publishes with these same statistics as in the AZ publication:

Notably, associations for 13% (3 of 24) and 29% (96 of 326) of the significant PTVs and missense variants, respectively, have not been reported in FinnGen release 5, OMIM, ClinVar or the GWAS catalogue

That would be the best way to compare the two IMO.

Regarding weighting evidence from these data, one of the plans we have in mind is to use the burden test results as predictors in our machine learning model to improve causal gene assignment at GWAS-associated loci

🤔 I'm trying to imagine how that would help -- is that that evidence of pathogenic coding variation clearly linked to a gene (like in these studies) implies that the gene itself is more likely to be causally linked to non-coding GWAS variants/loci (i.e. it's some gene-specific covariate instead of being specific to both the locus and gene in question for an L2G prediction)?

ktsirigos commented 2 years ago

@eric-czech is this OK if we post this on our Community space so it gets more visibility?

eric-czech commented 2 years ago

Yep, go for it.

On Thu, Dec 9, 2021 at 1:47 PM Kostas Tsirigos @.***> wrote:

@eric-czech https://github.com/eric-czech is this OK if we post this on our Community space so it gets more visibility?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/opentargets/platform/issues/1817#issuecomment-989820592, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOYVMCBKJCCXSFVWBHHDLLUQCQN5ANCNFSM5F7OG3UA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

MayaGhoussaini commented 2 years ago

Since that discussion, we decided within the team to initially integrate the Regeneron Burden test summary stat (https://www.nature.com/articles/s41586-021-04103-z) partly because GWAS Catalog has already integrated the sumstats from this study and did EFO mapping (though we may still need to do some additional manual mapping) and partly because it has the highest number of samples from UKBiobank so far(~455k). For each gene they did a strict burden rare pLOF test and a more permissive burden test (rare pLoF+ likely deleterious missense variants) and they applied this for different categories of MAF (e.g.≤ 1%, 0.1%, 0.01%, 0.001% and singletons) - so 10 burden tests per gene. In terms of how we can use them to improve gene prediction? Our initial thoughts are to potentially include the highly significant associations from the burden tests in the Gold Standards (we're confident that a given gene is linked to given trait/disease). We could also use the Burden test results (including sub-threshold associations) as predictors to inform the L2G score (e.g. having non-coding common and rare coding variants at the same locus/gene associated with the same phenotype will give you more confidence in that gene being causal even if the burden test isn't really significant (e.g. <10-11). However we're always keen to hear other thoughts.

eric-czech commented 2 years ago

In terms of how we can use them to improve gene prediction? Our initial thoughts are to potentially include the highly significant associations from the burden tests in the Gold Standards (we're confident that a given gene is linked to given trait/disease). We could also use the Burden test results (including sub-threshold associations) as predictors to inform the L2G score ...

Using that data to pad out the labeled set of cases for L2G training is a good fit IMO. This is anecdotal (and why I was looking for this data at scale in the first place), but I don't see the binary case/control associations from that study uncovering much new biology for autoimmune diseases. While that might not be the case for other therapeutic/disease areas, my speculation would be that the quantitative phenotype associations are more useful than adding the binary disease associations on top of existing GWAS results. Basically if both aren't possible, more Gold Standards sound like a good direction. I would imagine there are reasonable ways to accomplish both though.

Also on the topic of quantitative phenotype associations from that study, have you all ever discussed something like a phenotype 2 disease score? That study uses genetic correlation to link the two, but there are plenty of other ways to link intermediate phenotypes/exposures to diseases they might be involved in that wouldn't be captured in the way you do evidence propagation up through EFO now (or at least my understanding of it). MR and mendelian clinical presentation are two that come to mind.

To make that more concrete, what I'm asking is if a study like this uncovers genetic regulators of eosinophil proportions, how can Open Targets count this as genetic evidence for asthma?

eric-czech commented 2 years ago

Noting that the AZ PheWAS and ExWAS study results are now available at https://az.app.box.com/v/azphewas-com.

d0choa commented 2 years ago

Thanks, @eric-czech. We got access to it a few weeks ago and we've been working on the phenotype mappings. You can follow some of the action for the collapsing/burden data in #1941 and we are hoping to make progress in including rare variants for the variant-centric info in the genetics portal.

eric-czech commented 2 years ago

Nice, thanks for that link! It's exciting to see that go in, and that's definitely helpful context for interpreting the new dimensions that data adds. Is there already an issue open on incorporating the variant-level associations in the genetics portal?

ireneisdoomed commented 2 years ago

Hi @eric-czech! Our plans are to ingest variant-level associations through the GWAS Catalog. We are in discussions with them and on their portal, you can already find summary statistics for the REGENERON analysis here and they are currently in the process of bringing the AZ data in.

However, to see these associations in the Genetics Portal, we will have first to redefine the set of variants that are indexed in the Portal to include those of low frequency. We are tracking this effort on this other issue: Important changes in the variant index [#2074]

d0choa commented 2 years ago

Closing here. We can always follow up this conversation in https://community.opentargets.org involving a broader audience cc @HelenaCornu

eric-czech commented 2 years ago

we will have first to redefine the set of variants that are indexed in the Portal to include those of low frequency

Got it, thanks @ireneisdoomed! That is tricky and makes me wonder how meaningful the L2G pipeline is for rare variants if LD and credible sets generally require higher AF 🤔 . Either way, thanks again for the info.

opentargets / issues

Integrate recent UKB exome study results #1817