Closed DSuveges closed 2 years ago
Some notes that I took on the data from the GWAS Catalog:
587 single point associations extracted from processing the associations table from GWAS Catalog (All associations v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology)
👉 Remember top loci have a cut off of P < E-5
7972 studies extracted from processing the study table from GWAS Catalog (All studies v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology)
ASSOCIATION COUNT != 0
variantId
, variantFunctionalConsequenceId
)literature
and datatypeId
fields
{
"datasourceId": "regeneron_exwas",
"datatypeId": "genetic_association"
"targetFromSourceId": "ACAP3",
"diseaseFromSource": "6mm weak meridian left (5097)",
"diseaseFromSourceMappedId": "EFO_0004731",
"studyId": "GCST90083260",
"pValueMantissa": 9.25,
"pValueExponent": -12,
"beta": -0.148,
"betaConfidenceIntervalLower": -0.191,
"betaConfidenceIntervalUpper": -0.106,
"collapsingModel": "ADD-WGR-FIRTH",
"ancestries": ["EUR", "SAS"],
"literature": ["34662886"]
}
I've also updated some of the open questions posted above based on a discussion with @DSuveges today.
This is looking so promising already. Some thoughts:
datasourceId
to gene_burden
or something along these lines. I think the right level of generalisation for the datasource is to do it for any type of burden test. Not specific of the analysis and not specific of the dataset. The datasource will fall within genetic_association
datatype.ot_genetics_portal
that specifies whether the samples come from finngen, ukbb, etc. It would be probably good to reuse it here to say they are all ukbb samples.collapsingModel
being an string implies that in the AZ data, we will have one evidence per model. I think it might be the right approach, particularly for scoring purposes, but just flagging so it's a conscious decision.oddsRatio
and the respective confidence intervals for the binary traits.Comparison to V1:
ancestry
is no longer an array. The population is now an unique evidence identifier, therefore evidence will be exploded. These are mapped to HANCESTRO IDs:
resourceScore
, which is extracted from pValues.collapsingModel
in favour of studyId
and studyOverview
studyId
and added the GCST ID as a cross reference in urls
I've asked Annalisa about her opinion on using Hancestro and she is positive provided that we are consistent on this with other sources.
{
"datasourceId": "gene_burden",
"datatypeId": "genetic_association",
"targetFromSourceId": "ACAP3",
"diseaseFromSource": "6mm weak meridian left (5097)",
"diseaseFromSourceMappedId": "EFO_0004731",
"pValueMantissa": 9.25,
"pValueExponent": -12,
"beta": -0.148,
"betaConfidenceIntervalLower": -0.191,
"betaConfidenceIntervalUpper": -0.106,
"oddsRatio": null,
"oddsRatioConfidenceIntervalLower": null,
"oddsRatioConfidenceIntervalUpper": null,
"resourceScore": 9.250000e-12,
"ancestry": "HANCESTRO_0009",
"literature": ["34662886"],
"publicationYear": 2021,
"projectId": "REGENERON",
"cohortId": "UK Biobank",
"studyId": "ADD-WGR-FIRTH_M3.0001",
"studyOverview": "Burden test carried out with pLOFs and deleterious missense with a MAF smaller than 0.001%",
"studySampleSize": 89735,
"studyCases": 89735,
"urls": [
"url": "https://genetics.opentargets.org/study/GCST90083260",
"niceName": "GCST90083260"
]
}
These are a result of David's above comments, the discussion held in the 23/03 Data SU, and the effort in harmonising schema across AZ and Regeneron data
Open questions:
studyId
for this. What I propose above is to create an ID from the name of the model ADD-WGR-FIRTH
+ variants inclusion criteria (further described in studyOverview
)
studySampleSize
refers to the studied trait.N controls with 0|1|2 copies of effect allele
publicationFirstAuthor
and publicationYear
This model would currently work with the data provided in the AZ Phewas Portal, I'd provide an example with analysis of their data in a different issue.
In general this version drops the previous idea of referring to the statistical method applied under studyId
, leaving this field for the GCST ID for the sake of consistency with other data sources.
Comparison to V2:
ancestry
to ancestryId
ancestry
to capture the code related to the population meanwhile we do not resolve the label from ancestryId
statisticalMethod
and statisticalMethodOverview
to encapsulate the information previously described in studyId
{
"datasourceId": "gene_burden",
"datatypeId": "genetic_association",
"targetFromSourceId": "ACAP3",
"diseaseFromSource": "6mm weak meridian left (5097)",
"diseaseFromSourceMappedId": "EFO_0004731",
"pValueMantissa": 9.25,
"pValueExponent": -12,
"beta": -0.148,
"betaConfidenceIntervalLower": -0.191,
"betaConfidenceIntervalUpper": -0.106,
"oddsRatio": null,
"oddsRatioConfidenceIntervalLower": null,
"oddsRatioConfidenceIntervalUpper": null,
"resourceScore": 9.250000e-12,
"ancestry": "EUR",
"ancestryId": "HANCESTRO_0009",
"literature": ["34662886"],
"projectId": "REGENERON",
"cohortId": "UK Biobank",
"studyId": "GCST90083260",
"studySampleSize": 89735,
"studyCases": 89735,
"statisticalMethod": "ADD-WGR-FIRTH_M3.0001",
"statisticalMethodOverview": "Burden test carried out with pLOFs and deleterious missense with a MAF smaller than 0.001%",
"urls": null
}
Other than that, I have checked the records where nº of controls = 0. @DSuveges was absolutely right, these are studies where the trait is quantitative, so there are traits for which there are no controls.
I've submitted a PR to the JSON Schema to reflect this data model (#144)
Background
We want to explore extracting gene/disease associations from the exome sequencing analyses carried out by Regeneron Genetics Center.
Their work is collected in their recent publication.
Data (from supplementary data)
Significant trait associations with rare variants are reported in Supplementary Data 2 and 3.
Summary statistics for the rare variants tested in this study are also available in the GWAS Catalog (accession IDs are in Supplementary Data 4 and are listed separately for single variants and burden tests).
There are 2 key differences to what we get from GWAS Catalog and what we can extract from the publication:
To extract the gene level associations I will therefore have to use the raw data from their Supplementary Data 2 (available here), which is really comprehensive. Some metrics:
Proposed schema v0
Open Questions:
[X] Is the data type
genetic_association
or do we want to separate these analyses from the rest?[X] I have yet to understand what is the notation in this field for the case of burden analyses. E.g.
M1.01
. I presume this field would be null for these cases.Burden test nomenclature
sheet.Marker type == Burden
. This removes variant information from the schema (see schema proposal v1)[ ] Should we drop those variants tagged as low quality?
[X] We have information for several ancestries. How do we track this variability? E.g. we have different effect sizes for different populations. Do we explode the evidence? Note that we might not have consistent effect directions between ancestries.
[X] What should we include in
studyCases
? Carriers of 1 or 2 alleles?[ ] Should we use
P-value after controlling for GWAS signals
?[ ] Should we use
Effect after controlling for GWAS signals (95% CI)
?[X] Other fields of potential interest:
...
The exploration notebook is available here: https://github.com/ireneisdoomed/random_notebooks/blob/main/exome_data/regeneron/exploration.ipynb