opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Develop data model to capture gene-based association from exome sequencing analysis #1941

Closed DSuveges closed 2 years ago

DSuveges commented 2 years ago

Background

We want to explore extracting gene/disease associations from the exome sequencing analyses carried out by Regeneron Genetics Center.

Their work is collected in their recent publication.

Data (from supplementary data)

Significant trait associations with rare variants are reported in Supplementary Data 2 and 3.

Summary statistics for the rare variants tested in this study are also available in the GWAS Catalog (accession IDs are in Supplementary Data 4 and are listed separately for single variants and burden tests).

There are 2 key differences to what we get from GWAS Catalog and what we can extract from the publication:

To extract the gene level associations I will therefore have to use the raw data from their Supplementary Data 2 (available here), which is really comprehensive. Some metrics:

Proposed schema v0

{
"datasourceId": "regeneron_exwas",
"targetFromSourceId": "ACAP3",
"diseaseFromSource": "6mm weak meridian left (5097)",
"diseaseFromSourceMappedId": "EFO_0004731",
"variantId": "1_1303203_G_A",
"variantFunctionalConsequenceId": "SO_0001583", // From 'Marker type'
"studyId": "GCST90083260",
"pValue": 9.250000e-12,
"beta": -0.148,
"betaConfidenceIntervalLower": -0.191,
"betaConfidenceIntervalUpper": -0.106
}

Open Questions:

The exploration notebook is available here: https://github.com/ireneisdoomed/random_notebooks/blob/main/exome_data/regeneron/exploration.ipynb

ireneisdoomed commented 2 years ago

Some notes that I took on the data from the GWAS Catalog:

Data (from GWAS Catalog)

587 single point associations extracted from processing the associations table from GWAS Catalog (All associations v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology)

👉 Remember top loci have a cut off of P < E-5

7972 studies extracted from processing the study table from GWAS Catalog (All studies v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology)

ireneisdoomed commented 2 years ago

Proposed schema v1

I've also updated some of the open questions posted above based on a discussion with @DSuveges today.

d0choa commented 2 years ago

This is looking so promising already. Some thoughts:

ireneisdoomed commented 2 years ago

Proposed schema V2

Comparison to V1:

I've asked Annalisa about her opinion on using Hancestro and she is positive provided that we are consistent on this with other sources.

{
    "datasourceId": "gene_burden",
    "datatypeId": "genetic_association",
    "targetFromSourceId": "ACAP3",
    "diseaseFromSource": "6mm weak meridian left (5097)",
    "diseaseFromSourceMappedId": "EFO_0004731",
    "pValueMantissa": 9.25,
    "pValueExponent": -12,
    "beta": -0.148,
    "betaConfidenceIntervalLower": -0.191,
    "betaConfidenceIntervalUpper": -0.106,
    "oddsRatio": null,
    "oddsRatioConfidenceIntervalLower": null,
    "oddsRatioConfidenceIntervalUpper": null,
    "resourceScore": 9.250000e-12,
    "ancestry": "HANCESTRO_0009",
    "literature": ["34662886"],
    "publicationYear": 2021,
    "projectId": "REGENERON",
    "cohortId": "UK Biobank",
    "studyId": "ADD-WGR-FIRTH_M3.0001",
    "studyOverview": "Burden test carried out with pLOFs and deleterious missense with a MAF smaller than 0.001%",
    "studySampleSize": 89735,
    "studyCases": 89735,
    "urls": [
        "url": "https://genetics.opentargets.org/study/GCST90083260",
        "niceName": "GCST90083260"
    ]
}

These are a result of David's above comments, the discussion held in the 23/03 Data SU, and the effort in harmonising schema across AZ and Regeneron data

Open questions:

This model would currently work with the data provided in the AZ Phewas Portal, I'd provide an example with analysis of their data in a different issue.

ireneisdoomed commented 2 years ago

Proposed schema v3

In general this version drops the previous idea of referring to the statistical method applied under studyId, leaving this field for the GCST ID for the sake of consistency with other data sources.

Comparison to V2:

{
    "datasourceId": "gene_burden",
    "datatypeId": "genetic_association",
    "targetFromSourceId": "ACAP3",
    "diseaseFromSource": "6mm weak meridian left (5097)",
    "diseaseFromSourceMappedId": "EFO_0004731",
    "pValueMantissa": 9.25,
    "pValueExponent": -12,
    "beta": -0.148,
    "betaConfidenceIntervalLower": -0.191,
    "betaConfidenceIntervalUpper": -0.106,
    "oddsRatio": null,
    "oddsRatioConfidenceIntervalLower": null,
    "oddsRatioConfidenceIntervalUpper": null,
    "resourceScore": 9.250000e-12,
    "ancestry": "EUR",
    "ancestryId": "HANCESTRO_0009",
    "literature": ["34662886"],
    "projectId": "REGENERON",
    "cohortId": "UK Biobank",
    "studyId": "GCST90083260",
    "studySampleSize": 89735,
    "studyCases": 89735,
    "statisticalMethod": "ADD-WGR-FIRTH_M3.0001",
    "statisticalMethodOverview": "Burden test carried out with pLOFs and deleterious missense with a MAF smaller than 0.001%",
    "urls": null
}

Other than that, I have checked the records where nº of controls = 0. @DSuveges was absolutely right, these are studies where the trait is quantitative, so there are traits for which there are no controls.

I've submitted a PR to the JSON Schema to reflect this data model (#144)