Develop data model to capture gene-based association from exome sequencing analysis

DSuveges commented 2 years ago

Background

We want to explore extracting gene/disease associations from the exome sequencing analyses carried out by Regeneron Genetics Center.

Their work is collected in their recent publication.

Data (from supplementary data)

Significant trait associations with rare variants are reported in Supplementary Data 2 and 3.

Table 2: associations from European ancestries
Table 3: associations from non-European ancestries

All have a maximum p-Value of 2.18 E-11.

Summary statistics for the rare variants tested in this study are also available in the GWAS Catalog (accession IDs are in Supplementary Data 4 and are listed separately for single variants and burden tests).

There are 2 key differences to what we get from GWAS Catalog and what we can extract from the publication:

Burden analysis are out of the scope for GWAS Catalog.
When comparing both associations tables per accession, I see no overlap between them. This has already been reported to the GWAS Catalog (link here)

To extract the gene level associations I will therefore have to use the raw data from their Supplementary Data 2 (available here), which is really comprehensive. Some metrics:

8865 variant/gene/trait associations
- 168 of them without a Study accession
2283 gene/trait pairs
18285 records. Breakdown per ancestry:
- EUR 17544
- SAS 433
- AFR 182
- EAS 126
564 genes
492 traits
611 variants

973 GWAS Catalog accessions.

168 associations without a study.

0 overlap between the associations reported by GWAS Catalog and the raw data. Sample record to show the available data:

Ancestry                                                                                                                                                                    EUR
Gene                                                                                                                                                                        ACAP3
Trait                                                                                                                                               6mm weak meridian left (5097)
Trait description                                                                                                               This is the weak meridian of keratometry resul...
Trait type                                                                                                                                                                     QT
Marker                                                                                                                                                              1:1303203:G:A
Chr                                                                                                                                                                             1
Position                                                                                                                                                                  1303203
Reference allele                                                                                                                                                                G
Effect allele                                                                                                                                                                   A
Marker type                                                                                                                                                           DelMissense
Variant effect on protein sequence                                                                                                                          p.Arg62Cys:p.Arg20Cys
Effect (95% CI)                                                                                                                                           -0.148 (-0.191, -0.106)
P-value                                                                                                                                                                       0.0
Effect direction                                                                                                                                                                -
N cases with 0|1|2 copies of effect allele                                                                                                                          88038|1684|13
N controls with 0|1|2 copies of effect allele                                                                                                                            NA|NA|NA
Effect allele frequency                                                                                                                                                    0.0095
Minor allele count                                                                                                                                                           1710
Most significant trait-variant pair for this gene?                                                                                                                            Yes
Most significant variant for this gene-trait pair?                                                                                                                            Yes
Effect after controlling for GWAS signals (95% CI)                                                                                                         -0.046 (-0.09, -0.002)
P-value after controlling for GWAS signals                                                                                                                                 0.0384
P-value<2.18e-11 after controlling for GWAS signals?                                                                                                                           No
N variants included in the burden test                                                                                                                                        NaN
N variants (i) included in the burden test; and (ii) tested individually for association with the trait                                                                       NaN
N variants (i) included in burden test; (ii) tested individually for association with the trait; and (iii) with P<0.05                                                        NaN
N variants (i) included in burden test; (ii) tested individually for association with the trait; and (iii) with P<0.001                                                       NaN
N variants (i) included in burden test; (ii) tested individually for association with the trait; and (iii) with P<10-7                                                        NaN
N variants (i) included in burden test; (ii) tested individually for association with the trait; and (iii) with P<2.18x10-11                                                  NaN
GTEx tissues with enhanced expressiona                                                                                                Brain-CerebellarHemisphere,Brain-Cerebellum
UKB detailed trait name                                                                                                               5097_6mm_weak_meridian_left_inst_mean__RINT
Variant flagged as potential low-quality using machine learning                                                                                                                No
DiscovEHR replication trait                                                                                                                                                   NaN
DiscovEHR replication N cases                                                                                                                                                 NaN
DiscovEHR replication N controls                                                                                                                                              NaN
DiscovEHR replication effect                                                                                                                                                  NaN
DiscovEHR replication SE of effect                                                                                                                                            NaN
DiscovEHR replication P-value                                                                                                                                                 NaN
DiscovEHR replication N carriers                                                                                                                                                0
DiscovEHR replication effect directionally-consistent with UKB effect?                                                                                                        NaN
Effect (95% CI), UKB AFR ancestry                                                                                                                           0.193 (-0.286, 0.673)
P-value UKB, AFR ancestry                                                                                                                                                   0.429
Effect direction, UKB AFR ancestry                                                                                                                                              +
Effect direction consistent with EUR, UKB AFR ancestry                                                                                                                         No
N cases with 0|1|2 copies of effect allele, UKB AFR ancestry                                                                                                            3467|17|0
N controls with 0|1|2 copies of effect allele, UKB AFR ancestry                                                                                                          NA|NA|NA
Effect (95% CI), UKB EAS ancestry                                                                                                                          -0.264 (-0.845, 0.316)
P-value UKB, EAS ancestry                                                                                                                                                   0.372
Effect direction, UKB EAS ancestry                                                                                                                                              -
Effect direction consistent with EUR, UKB EAS ancestry                                                                                                                        Yes
N cases with 0|1|2 copies of effect allele, UKB EAS ancestry                                                                                                              642|8|0
N controls with 0|1|2 copies of effect allele, UKB EAS ancestry                                                                                                          NA|NA|NA
Effect (95% CI), UKB SAS ancestry                                                                                                                          -0.093 (-0.388, 0.203)
P-value UKB, SAS ancestry                                                                                                                                                   0.538
Effect direction, UKB SAS ancestry                                                                                                                                              -
Effect direction consistent with EUR, UKB SAS ancestry                                                                                                                        Yes
N cases with 0|1|2 copies of effect allele, UKB SAS ancestry                                                                                                            3900|36|1
N controls with 0|1|2 copies of effect allele, UKB SAS ancestry                                                                                                          NA|NA|NA
Study Accession                                                                                                                                                      GCST90079274

Proposed schema v0

{
"datasourceId": "regeneron_exwas",
"targetFromSourceId": "ACAP3",
"diseaseFromSource": "6mm weak meridian left (5097)",
"diseaseFromSourceMappedId": "EFO_0004731",
"variantId": "1_1303203_G_A",
"variantFunctionalConsequenceId": "SO_0001583", // From 'Marker type'
"studyId": "GCST90083260",
"pValue": 9.250000e-12,
"beta": -0.148,
"betaConfidenceIntervalLower": -0.191,
"betaConfidenceIntervalUpper": -0.106
}

Open Questions:

[X] Is the data type genetic_association or do we want to separate these analyses from the rest?
- No. Just a new source for the moment.
[X] I have yet to understand what is the notation in this field for the case of burden analyses. E.g. M1.01. I presume this field would be null for these cases.
- This is related to the MAF of variants included in burden test described in the Burden test nomenclature sheet.
- However, we are filtering out the single point analyses by only keeping those where Marker type == Burden. This removes variant information from the schema (see schema proposal v1)
[ ] Should we drop those variants tagged as low quality?
- This was flagged by a ML algoright on the basis of: (i) concordance in genotype calls between array and exome sequencing data; (ii) Mendelian inconsistencies in the exome sequencing data; (iii) differences in allele frequencies between exome sequencing batches; (iv) variant loadings on 20 principal components derived from the analysis of variants with a MAF of less than 1%; (v) transmitted singletons.
- TODO: See if this filter is applied in GWASCat, where do have the variant level assocs.
[X] We have information for several ancestries. How do we track this variability? E.g. we have different effect sizes for different populations. Do we explode the evidence? Note that we might not have consistent effect directions between ancestries.
- No. We collect the set and pick the most significant one.
- But what about the cases where direction of effect is different? Tricky, but still get the effect size for the most significant pValue.
[X] What should we include in studyCases? Carriers of 1 or 2 alleles?
- That doesn't matter. Cases only refer to people with the trait, independently from whether they are a carrier or not. Just sum them.
[ ] Should we use P-value after controlling for GWAS signals?
- Check what is the value used in the publication and use that one.
[ ] Should we use Effect after controlling for GWAS signals (95% CI)?
- Check what is the value used in the publication and use that one.
[X] Other fields of potential interest:
- Ancestry ✅
- Variant effect on protein sequence ❌
- Most significant trait-variant pair for this gene? ❌
- Most significant variant for this gene-trait pair? ❌
- GTEx tissues with enhanced expression? ❌
...

The exploration notebook is available here: https://github.com/ireneisdoomed/random_notebooks/blob/main/exome_data/regeneron/exploration.ipynb

ireneisdoomed commented 2 years ago

Some notes that I took on the data from the GWAS Catalog:

Data (from GWAS Catalog)

587 single point associations extracted from processing the associations table from GWAS Catalog (All associations v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology)

👉 Remember top loci have a cut off of P < E-5

DISEASE/TRAIT. Reported UKBB trait.
INITIAL SAMPLE SIZE. All European, but diff sample sizes. Why?
REPLICATION SAMPLE SIZE. All European, but diff sample sizes. Why?
STRONGEST SNP-RISK ALLELE. SNP(s) most strongly associated with trait + risk allele (? for unknown risk allele). May also refer to a haplotype.
- 215 different values. None are unknown.
SNPS. Strongest SNP; if a haplotype it may include more than one rs number
- 215 different values.
MERGED. Denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;)
- All 0s.
INTERGENIC. Denotes whether SNP is in intergenic region (0 = no; 1 = yes)
- All 1s.
RISK ALLELE FREQUENCY. Reported risk/effect allele frequency associated with strongest SNP in controls
P-VALUE. Reported p-value for strongest SNP risk allele
PVALUE_MLOG. -log(p-value)
OR or BETA. Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR <1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are >1
- 148 report OR values, 439 Beta
MAPPED_TRAIT. Label of the EFO trait.
MAPPED_TRAIT_URI. 114 different traits.
STUDY ACCESSION. 200 different study accessions.
GENOTYPING TECHNOLOGY. Genome-wide genotyping array, Exome-wide sequencing [UK Biobank/UK BiLEVE Axiom Array]

7972 studies extracted from processing the study table from GWAS Catalog (All studies v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology)

Very similar schema to the associations one.
1028 different traits
200 studies where ASSOCIATION COUNT != 0

ireneisdoomed commented 2 years ago

Proposed schema v1

Dropped variant information fields (variantId, variantFunctionalConsequenceId)
Added field to collect what is the collapsing model applied. It is always `ADD-WGR-FIRTH``
Ancestries will collect all the populations on which the association has been observed. We are not convinced about this one, as it is not clear this makes sense in a gene-centric context.
- To address for the different statistical values, we will be picking the most significant one.

Added literature and datatypeId fields

{
"datasourceId": "regeneron_exwas",
"datatypeId": "genetic_association"
"targetFromSourceId": "ACAP3",
"diseaseFromSource": "6mm weak meridian left (5097)",
"diseaseFromSourceMappedId": "EFO_0004731",
"studyId": "GCST90083260",
"pValueMantissa": 9.25,
"pValueExponent": -12,
"beta": -0.148,
"betaConfidenceIntervalLower": -0.191,
"betaConfidenceIntervalUpper": -0.106,
"collapsingModel": "ADD-WGR-FIRTH",
"ancestries": ["EUR", "SAS"],
"literature": ["34662886"]
}

I've also updated some of the open questions posted above based on a discussion with @DSuveges today.

d0choa commented 2 years ago

This is looking so promising already. Some thoughts:

I would change datasourceId to gene_burden or something along these lines. I think the right level of generalisation for the datasource is to do it for any type of burden test. Not specific of the analysis and not specific of the dataset. The datasource will fall within genetic_association datatype.
Based on the above, the proposed schema needs to work for the Regeneron and AZ analysis in the short term.
I remember we have a field in ot_genetics_portal that specifies whether the samples come from finngen, ukbb, etc. It would be probably good to reuse it here to say they are all ukbb samples.
The collapsingModel being an string implies that in the AZ data, we will have one evidence per model. I think it might be the right approach, particularly for scoring purposes, but just flagging so it's a conscious decision.
We will need the oddsRatio and the respective confidence intervals for the binary traits.
The number of cases is also relevant. Capturing the total sample size and the number of cases with qualifying variants is probably the minimum, but there might be other relevant numbers.
Regarding ancestries, seems reasonable to capture it if available. The AZ paper in the old data only reported one non-EUR-specific association. HBB is linked to thalassemia in tropical latitudes due to its protection against malaria.
For the particular case of AZ evidence, we will like to have link outs back to AZphewas. They suggested they might provide an API to resolve the URLs which at the moment contain some hashes. So I don't think there will be much to do data-wise.

ireneisdoomed commented 2 years ago

Proposed schema V2

Comparison to V1:

ancestry is no longer an array. The population is now an unique evidence identifier, therefore evidence will be exploded. These are mapped to HANCESTRO IDs:
- EUR: HANCESTRO_0005
- EAS: HANCESTRO_0009
- AFR: HANCESTRO_0010
- SAS: HANCESTRO_0006
Added resourceScore, which is extracted from pValues.
Added OR and their confidence intervals.
Deleted collapsingModel in favour of studyId and studyOverview
Deleted studyId and added the GCST ID as a cross reference in urls
Added information on the study sample size.

I've asked Annalisa about her opinion on using Hancestro and she is positive provided that we are consistent on this with other sources.

{
    "datasourceId": "gene_burden",
    "datatypeId": "genetic_association",
    "targetFromSourceId": "ACAP3",
    "diseaseFromSource": "6mm weak meridian left (5097)",
    "diseaseFromSourceMappedId": "EFO_0004731",
    "pValueMantissa": 9.25,
    "pValueExponent": -12,
    "beta": -0.148,
    "betaConfidenceIntervalLower": -0.191,
    "betaConfidenceIntervalUpper": -0.106,
    "oddsRatio": null,
    "oddsRatioConfidenceIntervalLower": null,
    "oddsRatioConfidenceIntervalUpper": null,
    "resourceScore": 9.250000e-12,
    "ancestry": "HANCESTRO_0009",
    "literature": ["34662886"],
    "publicationYear": 2021,
    "projectId": "REGENERON",
    "cohortId": "UK Biobank",
    "studyId": "ADD-WGR-FIRTH_M3.0001",
    "studyOverview": "Burden test carried out with pLOFs and deleterious missense with a MAF smaller than 0.001%",
    "studySampleSize": 89735,
    "studyCases": 89735,
    "urls": [
        "url": "https://genetics.opentargets.org/study/GCST90083260",
        "niceName": "GCST90083260"
    ]
}

These are a result of David's above comments, the discussion held in the 23/03 Data SU, and the effort in harmonising schema across AZ and Regeneron data

Open questions:

After thinking about it, I think it is not necessary to create a new field to inform about the collapsing model used. We can use the existing studyId for this. What I propose above is to create an ID from the name of the model ADD-WGR-FIRTH + variants inclusion criteria (further described in studyOverview)
- My questions are: 1) if you agree with this approach; 2) if you think the ID should be more complex and include trait information. This is usually the case for other cases (like Finngen) and studySampleSize refers to the studied trait.
Is it normal that number of controls is 0 for a large part of studies? This is coming from N controls with 0|1|2 copies of effect allele
Is it relevant to have information about the publication? publicationFirstAuthor and publicationYear

This model would currently work with the data provided in the AZ Phewas Portal, I'd provide an example with analysis of their data in a different issue.

ireneisdoomed commented 2 years ago

Proposed schema v3

In general this version drops the previous idea of referring to the statistical method applied under studyId, leaving this field for the GCST ID for the sake of consistency with other data sources.

Comparison to V2:

Changed name from ancestry to ancestryId
Added ancestry to capture the code related to the population meanwhile we do not resolve the label from ancestryId
Added statisticalMethod and statisticalMethodOverview to encapsulate the information previously described in studyId
GCST ID is now under `studyId
Dropped the publication specific information as suggested above.

{
    "datasourceId": "gene_burden",
    "datatypeId": "genetic_association",
    "targetFromSourceId": "ACAP3",
    "diseaseFromSource": "6mm weak meridian left (5097)",
    "diseaseFromSourceMappedId": "EFO_0004731",
    "pValueMantissa": 9.25,
    "pValueExponent": -12,
    "beta": -0.148,
    "betaConfidenceIntervalLower": -0.191,
    "betaConfidenceIntervalUpper": -0.106,
    "oddsRatio": null,
    "oddsRatioConfidenceIntervalLower": null,
    "oddsRatioConfidenceIntervalUpper": null,
    "resourceScore": 9.250000e-12,
    "ancestry": "EUR",
    "ancestryId": "HANCESTRO_0009",
    "literature": ["34662886"],
    "projectId": "REGENERON",
    "cohortId": "UK Biobank",
    "studyId": "GCST90083260",
    "studySampleSize": 89735,
    "studyCases": 89735,
    "statisticalMethod": "ADD-WGR-FIRTH_M3.0001",
    "statisticalMethodOverview": "Burden test carried out with pLOFs and deleterious missense with a MAF smaller than 0.001%",
    "urls": null
}

Other than that, I have checked the records where nº of controls = 0. @DSuveges was absolutely right, these are studies where the trait is quantitative, so there are traits for which there are no controls.

I've submitted a PR to the JSON Schema to reflect this data model (#144)

opentargets / issues