Open d0choa opened 4 months ago
If we want to ensure uniqueness of the dataset based on studyLocusId
, it should be the generated from the following fields:
variantId
- Identifier of the lead variantstudyId
- Identifier of the study the locus belongs tocredibleSetIndex
- as one locus might have multiple credible sets. finemappingMethod
- as one locus might be fine-mapped by multiple methods as well.However, the process is not clear eg. the window based clumping would generate study locus with some id, but then it should be overwritten once a downstream finemapping process picks up and does the finemapping.
@DSuveges could you help me with a copy of the data in json format please
Some planned changes to data/schema are expected on this front. @DSuveges will list them here when plan is finalised.
Some updates on the credible set schema:
There's an updated credible set dataset in json for the backend team to ingest: gs://ot-team/dsuveges/credible_sets.json
Number of credible sets: 3,019,002 The distribution of credible sets across study types:
+---------+-------+
|studyType| count|
+---------+-------+
| tuqtl| 411170|
| gwas| 876691|
| sqtl| 234066|
| pqtl| 1772|
| eqtl|1495303|
+---------+-------+
Changes
l2g
to strongestLocus2gene
qtlGeneId
to make it more obvious.ldSet
column as we don't really have any plans with them.studyLocusId
is casted to strings. root
|-- studyId: string (nullable = true)
|-- studyLocusId: string (nullable = true)
|-- variantId: string (nullable = true)
|-- chromosome: string (nullable = true)
|-- position: integer (nullable = true)
|-- region: string (nullable = true)
|-- beta: double (nullable = true)
|-- zScore: double (nullable = true)
|-- pValueMantissa: float (nullable = true)
|-- pValueExponent: integer (nullable = true)
|-- effectAlleleFrequencyFromSource: float (nullable = true)
|-- standardError: double (nullable = true)
|-- subStudyDescription: string (nullable = true)
|-- qualityControls: array (nullable = true)
| |-- element: string (containsNull = true)
|-- finemappingMethod: string (nullable = true)
|-- credibleSetIndex: integer (nullable = true)
|-- credibleSetlog10BF: double (nullable = true)
|-- purityMeanR2: double (nullable = true)
|-- purityMinR2: double (nullable = true)
|-- locusStart: integer (nullable = true)
|-- locusEnd: integer (nullable = true)
|-- sampleSize: integer (nullable = true)
|-- locus: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- is95CredibleSet: boolean (nullable = true)
| | |-- is99CredibleSet: boolean (nullable = true)
| | |-- logBF: double (nullable = true)
| | |-- posteriorProbability: double (nullable = true)
| | |-- variantId: string (nullable = true)
| | |-- pValueMantissa: float (nullable = true)
| | |-- pValueExponent: integer (nullable = true)
| | |-- beta: double (nullable = true)
| | |-- standardError: double (nullable = true)
| | |-- r2Overall: double (nullable = true)
|-- strongestLocus2gene: struct (nullable = true)
| |-- geneId: string (nullable = true)
| |-- score: double (nullable = true)
|-- studyType: string (nullable = true)
|-- traitFromSourceMappedIds: array (nullable = true)
| |-- element: string (containsNull = true)
|-- qtlGeneId: string (nullable = true)
For clarity, the number of tag variants are truncated.
{
"studyId": "GCST90105038",
"studyLocusId": "-9221662644183536443",
"variantId": "3_50557710_C_T",
"chromosome": "3",
"position": 50557710,
"region": null,
"beta": 0.0134335,
"zScore": null,
"pValueMantissa": 6.0,
"pValueExponent": -15,
"effectAlleleFrequencyFromSource": null,
"standardError": null,
"subStudyDescription": null,
"qualityControls": [],
"finemappingMethod": "pics",
"credibleSetIndex": null,
"credibleSetlog10BF": null,
"purityMeanR2": null,
"purityMinR2": null,
"locusStart": null,
"locusEnd": null,
"sampleSize": null,
"locus": [
{
"is95CredibleSet": true,
"is99CredibleSet": true,
"logBF": null,
"posteriorProbability": 0.8281915098204469,
"variantId": "3_50557710_C_T",
"pValueMantissa": null,
"pValueExponent": null,
"beta": null,
"standardError": 0.9999996011812708,
"r2Overall": 1.0000000000000027
},
{
"is95CredibleSet": true,
"is99CredibleSet": true,
"logBF": null,
"posteriorProbability": 0.018635570174451596,
"variantId": "3_50972004_C_T",
"pValueMantissa": null,
"pValueExponent": null,
"beta": null,
"standardError": 0.0387769970112231,
"r2Overall": 0.7735468485352232
},
],
"strongestLocus2gene": {
"geneId": "ENSG00000088538",
"score": 0.7871325016021729
},
"studyType": "gwas",
"traitFromSourceMappedIds": [
"EFO_0011015"
],
"qtlGeneId": null
}
{
"studyId": "FINNGEN_R10_C3_RECTUM_ADENO_MUCINO_EXALLC",
"studyLocusId": "-9131389010760691102",
"variantId": "18_48925503_G_A",
"chromosome": "18",
"position": 48925503,
"region": "chr18:47425503-50425503",
"beta": -0.19892,
"zScore": null,
"pValueMantissa": 2.0810000896453857,
"pValueExponent": -12,
"effectAlleleFrequencyFromSource": 0.49358999729156494,
"standardError": 0.0283,
"subStudyDescription": null,
"qualityControls": null,
"finemappingMethod": "SuSie",
"credibleSetIndex": 1,
"credibleSetlog10BF": 5.91695653245,
"purityMeanR2": null,
"purityMinR2": null,
"locusStart": null,
"locusEnd": null,
"sampleSize": null,
"locus": [
{
"is95CredibleSet": true,
"is99CredibleSet": true,
"logBF": 22.2531357731279,
"posteriorProbability": 0.254337321094081,
"variantId": "18_48925503_G_A",
"pValueMantissa": 2.0810000896453857,
"pValueExponent": -12,
"beta": -0.19892,
"standardError": 0.0283,
"r2Overall": null
},
{
"is95CredibleSet": true,
"is99CredibleSet": true,
"logBF": 22.1759052599176,
"posteriorProbability": 0.23543406807608,
"variantId": "18_48925435_C_CG",
"pValueMantissa": 2.255000114440918,
"pValueExponent": -12,
"beta": -0.198598,
"standardError": 0.0282994,
"r2Overall": null
}
],
"strongestLocus2gene": {
"geneId": "ENSG00000101665",
"score": 0.9262983798980713
},
"studyType": "gwas",
"traitFromSourceMappedIds": null,
"qtlGeneId": null
}
{
"studyId": "GTEx_esophagus_muscularis_ENSG00000105483.grp_2.upstream.ENST00000391898",
"studyLocusId": "-9223054931290123158",
"variantId": "19_48204408_G_A",
"chromosome": "19",
"position": 48204408,
"region": "chr19:47255946-49255946",
"beta": -0.274482,
"zScore": null,
"pValueMantissa": 1.0399999618530273,
"pValueExponent": -7,
"effectAlleleFrequencyFromSource": null,
"standardError": 0.0507531,
"subStudyDescription": null,
"qualityControls": null,
"finemappingMethod": "SuSie",
"credibleSetIndex": 1,
"credibleSetlog10BF": 14.906432151794434,
"purityMeanR2": null,
"purityMinR2": null,
"locusStart": null,
"locusEnd": null,
"sampleSize": null,
"locus": [
{
"is95CredibleSet": true,
"is99CredibleSet": true,
"logBF": 12.9898148734311,
"posteriorProbability": 0.143881970498929,
"variantId": "19_48204408_G_A",
"pValueMantissa": 1.0399999618530273,
"pValueExponent": -7,
"beta": -0.274482,
"standardError": 0.0507531,
"r2Overall": null
},
{
"is95CredibleSet": true,
"is99CredibleSet": true,
"logBF": 12.9898148734311,
"posteriorProbability": 0.143881970498929,
"variantId": "19_48204483_G_A",
"pValueMantissa": 1.0399999618530273,
"pValueExponent": -7,
"beta": -0.274482,
"standardError": 0.0507531,
"r2Overall": null
}
],
"strongestLocus2gene": null,
"studyType": "tuqtl",
"traitFromSourceMappedIds": null,
"qtlGeneId": "ENSG00000105483"
}
@remo87 @jdhayhurst : In this schema there are a handful of identifiers to resolve:
variantId
-> Resolve in the variant index.locus.variantId
-> Resolve in the variant index.strongestLocus2gene. geneId
-> Resolve in target indexqtlGeneId
-> Resolve in target indexstudyId
-> Resolve in study indexSome notes with @jdhayhurst on the credible set endpoints
Study page:
studyId
Variant page:
locus.variantId
liststudyType
. it should be an enum. We need to provide a list with all the qualifying studyType
(e.g. ['eqtl', 'pqtl', 'sqtl'])Disease page:
diseaseIds
(pending work from @DSuveges) studyType
. it should be an enum. We need to provide a list with all the qualifying studyType
(e.g. ['eqtl', 'pqtl', 'sqtl'])Root-level endpoint:
I don't think we are missing anything obvious but it might be good to have a look @addramir & @DSuveges
Update 14/08: Hoping for a first iteration by Friday
@remo87 The JSON formatted data is here: gs://ot-team/dsuveges/credible_sets.json
.
Schema change: instead of traitFromSourceMappedIds
field there's a diseaseIds
field containing a list of validated EFO IDs, that should be resolvable against the disease index.
Important - Although the dataset was generated in a validation step, the flagged credible sets are still in the dataset except those, which don't have variantId
.
Schema:
root
|-- studyId: string (nullable = true)
|-- studyLocusId: string (nullable = true)
|-- variantId: string (nullable = true)
|-- chromosome: string (nullable = true)
|-- position: integer (nullable = true)
|-- region: string (nullable = true)
|-- beta: double (nullable = true)
|-- zScore: double (nullable = true)
|-- pValueMantissa: float (nullable = true)
|-- pValueExponent: integer (nullable = true)
|-- effectAlleleFrequencyFromSource: float (nullable = true)
|-- standardError: double (nullable = true)
|-- subStudyDescription: string (nullable = true)
|-- qualityControls: array (nullable = true)
| |-- element: string (containsNull = true)
|-- finemappingMethod: string (nullable = true)
|-- credibleSetIndex: integer (nullable = true)
|-- credibleSetlog10BF: double (nullable = true)
|-- purityMeanR2: double (nullable = true)
|-- purityMinR2: double (nullable = true)
|-- locusStart: integer (nullable = true)
|-- locusEnd: integer (nullable = true)
|-- sampleSize: integer (nullable = true)
|-- locus: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- is95CredibleSet: boolean (nullable = true)
| | |-- is99CredibleSet: boolean (nullable = true)
| | |-- logBF: double (nullable = true)
| | |-- posteriorProbability: double (nullable = true)
| | |-- variantId: string (nullable = true)
| | |-- pValueMantissa: float (nullable = true)
| | |-- pValueExponent: integer (nullable = true)
| | |-- beta: double (nullable = true)
| | |-- standardError: double (nullable = true)
| | |-- r2Overall: double (nullable = true)
|-- strongestLocus2gene: struct (nullable = true)
| |-- geneId: string (nullable = true)
| |-- score: double (nullable = true)
|-- studyType: string (nullable = true)
|-- diseaseIds: array (nullable = true)
| |-- element: string (containsNull = true)
|-- qtlGeneId: string (nullable = true)
studyTypes
:For the enum, these are the current values of study types: tuqtl
, gwas
, sqtl
, pqtl
, eqtl
the distribution of the labels:
+---------+-------+
|studyType| count|
+---------+-------+
| tuqtl| 411170|
| gwas| 793759|
| sqtl| 234066|
| pqtl| 1772|
| eqtl|1495303|
+---------+-------+
Discussed on slack: we should change field name credibleSetId
(all occurrences) to studyLocusId
in credibleSet API to avoid confusion (especially for FE development)
Based on discussion with @d0choa and @jdhayhurst : as part of planning evidence and l2g integration, some updates are proposed on the credible set data ingestion and modelling:
studyLocusId
and pulling the highest l2g score.
Similar to the work on variant index (#3350) and study index (#3357), we would like to serve a
credible_set
dataset through OS + API.This dataset is created by different upstream ETL processes (gentropy), but they all write to the exact location and contain a shared schema. So effectively, they can be considered one single dataset that we would like to load:
Note that these datasets are currently pending validation and will be implemented in the gentropy layer. Records are expected to have:
studyLocusId
variantId
andlocus.variantId
available in the variant indexstudyId
available in the study index(This has been discussed with @DSuveges but there is not ticket yet)
The credible sets use the
study_locus
as the reference schema.Some stats (they might float in future iterations but not significantly):
Parquet size: 4.1 GiB Rows (credible sets): 2_982_370
I'm making some comments in the next schema:
We want a top-level
credible_set
endpoint, but this dataset will need to be queried in additional ways, which we can define later.