`credible set` data for backend integration

d0choa commented 4 months ago

Similar to the work on variant index (#3350) and study index (#3357), we would like to serve a credible_set dataset through OS + API.

This dataset is created by different upstream ETL processes (gentropy), but they all write to the exact location and contain a shared schema. So effectively, they can be considered one single dataset that we would like to load:

❯ gsutil ls gs://genetics_etl_python_playground/releases/24.06/credible_set/
gs://genetics_etl_python_playground/releases/24.06/credible_set/eqtl_catalogue_susie/
gs://genetics_etl_python_playground/releases/24.06/credible_set/finngen_susie/
gs://genetics_etl_python_playground/releases/24.06/credible_set/gwas_catalog_PICSed_curated_associations/
gs://genetics_etl_python_playground/releases/24.06/credible_set/gwas_catalog_PICSed_summary_statistics/

Note that these datasets are currently pending validation and will be implemented in the gentropy layer. Records are expected to have:

Unique studyLocusId
variantId and locus.variantId available in the variant index
studyId available in the study index

(This has been discussed with @DSuveges but there is not ticket yet)

The credible sets use the study_locus as the reference schema.

Some stats (they might float in future iterations but not significantly):

Parquet size: 4.1 GiB Rows (credible sets): 2_982_370

I'm making some comments in the next schema:

root
 |-- variantId: string (nullable = true) => Resolve using new variant index (#3350)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- studyId: string (nullable = true) => Resolve using new study index (#3357)
 |-- beta: double (nullable = true)
 |-- pValueMantissa: float (nullable = true)
 |-- pValueExponent: integer (nullable = true)
 |-- standardError: double (nullable = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetIndex: integer (nullable = true)
 |-- locus: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- variantId: string (nullable = true) => Resolve using new variant index (#3350)
 |    |    |-- posteriorProbability: double (nullable = true)
 |    |    |-- pValueMantissa: float (nullable = true)
 |    |    |-- pValueExponent: integer (nullable = true)
 |    |    |-- logBF: double (nullable = true)
 |    |    |-- beta: double (nullable = true)
 |    |    |-- standardError: double (nullable = true)
 |    |    |-- is95CredibleSet: boolean (nullable = true)
 |    |    |-- is99CredibleSet: boolean (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)
 |-- studyLocusId: long (nullable = true) => This will define unique schemas
 |-- credibleSetlog10BF: double (nullable = true)
 |-- effectAlleleFrequencyFromSource: float (nullable = true)
 |-- zScore: double (nullable = true)
 |-- subStudyDescription: string (nullable = true)
 |-- qualityControls: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- purityMeanR2: double (nullable = true)
 |-- purityMinR2: double (nullable = true)
 |-- sampleSize: integer (nullable = true)
 |-- ldSet: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- tagVariantId: string (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)

We want a top-level credible_set endpoint, but this dataset will need to be queried in additional ways, which we can define later.

DSuveges commented 4 months ago

If we want to ensure uniqueness of the dataset based on studyLocusId, it should be the generated from the following fields:

variantId - Identifier of the lead variant
studyId - Identifier of the study the locus belongs to
~~credibleSetIndex - as one locus might have multiple credible sets.~~ this is not needed, credible sets are exploded and are given separate variant ids.
finemappingMethod - as one locus might be fine-mapped by multiple methods as well.

However, the process is not clear eg. the window based clumping would generate study locus with some id, but then it should be overwritten once a downstream finemapping process picks up and does the finemapping.

remo87 commented 4 months ago

@DSuveges could you help me with a copy of the data in json format please

buniello commented 4 months ago

Some planned changes to data/schema are expected on this front. @DSuveges will list them here when plan is finalised.

DSuveges commented 4 months ago

Some updates on the credible set schema:

[x] adding maximum l2g score.
[x] adding geneId for maximum l2g score.
[x] adding study type that allows sorting credible sets between gwas and qtl widget.
[x] adding EFO of the gwas study that allow showing credible sets on the disease page.
[x] adding target id from qtl studies that allows showing credible sets on target profile page.

DSuveges commented 4 months ago

There's an updated credible set dataset in json for the backend team to ingest: gs://ot-team/dsuveges/credible_sets.json

Number of credible sets: 3,019,002 The distribution of credible sets across study types:

+---------+-------+
|studyType|  count|
+---------+-------+
|    tuqtl| 411170|
|     gwas| 876691|
|     sqtl| 234066|
|     pqtl|   1772|
|     eqtl|1495303|
+---------+-------+

Schema

Changes

I though it would be better to rename field l2g to strongestLocus2gene
The geneId from QTLs are named as qtlGeneId to make it more obvious.
I have dropped the ldSet column as we don't really have any plans with them.
Following the suggestion from the developer team, the studyLocusId is casted to strings.

root
 |-- studyId: string (nullable = true)
 |-- studyLocusId: string (nullable = true)
 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- beta: double (nullable = true)
 |-- zScore: double (nullable = true)
 |-- pValueMantissa: float (nullable = true)
 |-- pValueExponent: integer (nullable = true)
 |-- effectAlleleFrequencyFromSource: float (nullable = true)
 |-- standardError: double (nullable = true)
 |-- subStudyDescription: string (nullable = true)
 |-- qualityControls: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetIndex: integer (nullable = true)
 |-- credibleSetlog10BF: double (nullable = true)
 |-- purityMeanR2: double (nullable = true)
 |-- purityMinR2: double (nullable = true)
 |-- locusStart: integer (nullable = true)
 |-- locusEnd: integer (nullable = true)
 |-- sampleSize: integer (nullable = true)
 |-- locus: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- is95CredibleSet: boolean (nullable = true)
 |    |    |-- is99CredibleSet: boolean (nullable = true)
 |    |    |-- logBF: double (nullable = true)
 |    |    |-- posteriorProbability: double (nullable = true)
 |    |    |-- variantId: string (nullable = true)
 |    |    |-- pValueMantissa: float (nullable = true)
 |    |    |-- pValueExponent: integer (nullable = true)
 |    |    |-- beta: double (nullable = true)
 |    |    |-- standardError: double (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)
 |-- strongestLocus2gene: struct (nullable = true)
 |    |-- geneId: string (nullable = true)
 |    |-- score: double (nullable = true)
 |-- studyType: string (nullable = true)
 |-- traitFromSourceMappedIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- qtlGeneId: string (nullable = true)

How the data looks like: PICS credible sets

For clarity, the number of tag variants are truncated.

{
  "studyId": "GCST90105038",
  "studyLocusId": "-9221662644183536443",
  "variantId": "3_50557710_C_T",
  "chromosome": "3",
  "position": 50557710,
  "region": null,
  "beta": 0.0134335,
  "zScore": null,
  "pValueMantissa": 6.0,
  "pValueExponent": -15,
  "effectAlleleFrequencyFromSource": null,
  "standardError": null,
  "subStudyDescription": null,
  "qualityControls": [],
  "finemappingMethod": "pics",
  "credibleSetIndex": null,
  "credibleSetlog10BF": null,
  "purityMeanR2": null,
  "purityMinR2": null,
  "locusStart": null,
  "locusEnd": null,
  "sampleSize": null,
  "locus": [
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": null,
      "posteriorProbability": 0.8281915098204469,
      "variantId": "3_50557710_C_T",
      "pValueMantissa": null,
      "pValueExponent": null,
      "beta": null,
      "standardError": 0.9999996011812708,
      "r2Overall": 1.0000000000000027
    },
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": null,
      "posteriorProbability": 0.018635570174451596,
      "variantId": "3_50972004_C_T",
      "pValueMantissa": null,
      "pValueExponent": null,
      "beta": null,
      "standardError": 0.0387769970112231,
      "r2Overall": 0.7735468485352232
    },
  ],
  "strongestLocus2gene": {
    "geneId": "ENSG00000088538",
    "score": 0.7871325016021729
  },
  "studyType": "gwas",
  "traitFromSourceMappedIds": [
    "EFO_0011015"
  ],
  "qtlGeneId": null
}

How the data looks like: SuSiE credible sets

For clarity, the number of tag variants are truncated.
Only finngen, at this point without disease

{
  "studyId": "FINNGEN_R10_C3_RECTUM_ADENO_MUCINO_EXALLC",
  "studyLocusId": "-9131389010760691102",
  "variantId": "18_48925503_G_A",
  "chromosome": "18",
  "position": 48925503,
  "region": "chr18:47425503-50425503",
  "beta": -0.19892,
  "zScore": null,
  "pValueMantissa": 2.0810000896453857,
  "pValueExponent": -12,
  "effectAlleleFrequencyFromSource": 0.49358999729156494,
  "standardError": 0.0283,
  "subStudyDescription": null,
  "qualityControls": null,
  "finemappingMethod": "SuSie",
  "credibleSetIndex": 1,
  "credibleSetlog10BF": 5.91695653245,
  "purityMeanR2": null,
  "purityMinR2": null,
  "locusStart": null,
  "locusEnd": null,
  "sampleSize": null,
  "locus": [
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": 22.2531357731279,
      "posteriorProbability": 0.254337321094081,
      "variantId": "18_48925503_G_A",
      "pValueMantissa": 2.0810000896453857,
      "pValueExponent": -12,
      "beta": -0.19892,
      "standardError": 0.0283,
      "r2Overall": null
    },
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": 22.1759052599176,
      "posteriorProbability": 0.23543406807608,
      "variantId": "18_48925435_C_CG",
      "pValueMantissa": 2.255000114440918,
      "pValueExponent": -12,
      "beta": -0.198598,
      "standardError": 0.0282994,
      "r2Overall": null
    }
  ],
  "strongestLocus2gene": {
    "geneId": "ENSG00000101665",
    "score": 0.9262983798980713
  },
  "studyType": "gwas",
  "traitFromSourceMappedIds": null,
  "qtlGeneId": null
}

How the data looks like: QTL studies

For clarity, the number of tag variants are truncated.


{
  "studyId": "GTEx_esophagus_muscularis_ENSG00000105483.grp_2.upstream.ENST00000391898",
  "studyLocusId": "-9223054931290123158",
  "variantId": "19_48204408_G_A",
  "chromosome": "19",
  "position": 48204408,
  "region": "chr19:47255946-49255946",
  "beta": -0.274482,
  "zScore": null,
  "pValueMantissa": 1.0399999618530273,
  "pValueExponent": -7,
  "effectAlleleFrequencyFromSource": null,
  "standardError": 0.0507531,
  "subStudyDescription": null,
  "qualityControls": null,
  "finemappingMethod": "SuSie",
  "credibleSetIndex": 1,
  "credibleSetlog10BF": 14.906432151794434,
  "purityMeanR2": null,
  "purityMinR2": null,
  "locusStart": null,
  "locusEnd": null,
  "sampleSize": null,
  "locus": [
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": 12.9898148734311,
      "posteriorProbability": 0.143881970498929,
      "variantId": "19_48204408_G_A",
      "pValueMantissa": 1.0399999618530273,
      "pValueExponent": -7,
      "beta": -0.274482,
      "standardError": 0.0507531,
      "r2Overall": null
    },
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": 12.9898148734311,
      "posteriorProbability": 0.143881970498929,
      "variantId": "19_48204483_G_A",
      "pValueMantissa": 1.0399999618530273,
      "pValueExponent": -7,
      "beta": -0.274482,
      "standardError": 0.0507531,
      "r2Overall": null
    }
  ],
  "strongestLocus2gene": null,
  "studyType": "tuqtl",
  "traitFromSourceMappedIds": null,
  "qtlGeneId": "ENSG00000105483"
}

DSuveges commented 3 months ago

@remo87 @jdhayhurst : In this schema there are a handful of identifiers to resolve:

variantId -> Resolve in the variant index.
locus.variantId -> Resolve in the variant index.
strongestLocus2gene. geneId -> Resolve in target index
qtlGeneId -> Resolve in target index
studyId -> Resolve in study index

d0choa commented 3 months ago

Some notes with @jdhayhurst on the credible set endpoints

Study page:

Fixed entity: studyId

Variant page:

Fixed entity: all credible sets that contain page variant in the locus.variantId list
Filtering by studyType. it should be an enum. We need to provide a list with all the qualifying studyType (e.g. ['eqtl', 'pqtl', 'sqtl'])

Disease page:

Fixed entity: diseaseIds (pending work from @DSuveges)
Filtering by studyType. it should be an enum. We need to provide a list with all the qualifying studyType (e.g. ['eqtl', 'pqtl', 'sqtl'])

Root-level endpoint:

credibleSetId: [String]
diseaseId: [String]
studyId: [String]
variantId: [String]
study_type: [String]
region [String]

I don't think we are missing anything obvious but it might be good to have a look @addramir & @DSuveges

d0choa commented 3 months ago

Update 14/08: Hoping for a first iteration by Friday

DSuveges commented 3 months ago

New iteration of the credible set dataset

@remo87 The JSON formatted data is here: gs://ot-team/dsuveges/credible_sets.json. Schema change: instead of traitFromSourceMappedIds field there's a diseaseIds field containing a list of validated EFO IDs, that should be resolvable against the disease index.

Important - Although the dataset was generated in a validation step, the flagged credible sets are still in the dataset except those, which don't have variantId.

Schema:

root
 |-- studyId: string (nullable = true)
 |-- studyLocusId: string (nullable = true)
 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- beta: double (nullable = true)
 |-- zScore: double (nullable = true)
 |-- pValueMantissa: float (nullable = true)
 |-- pValueExponent: integer (nullable = true)
 |-- effectAlleleFrequencyFromSource: float (nullable = true)
 |-- standardError: double (nullable = true)
 |-- subStudyDescription: string (nullable = true)
 |-- qualityControls: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetIndex: integer (nullable = true)
 |-- credibleSetlog10BF: double (nullable = true)
 |-- purityMeanR2: double (nullable = true)
 |-- purityMinR2: double (nullable = true)
 |-- locusStart: integer (nullable = true)
 |-- locusEnd: integer (nullable = true)
 |-- sampleSize: integer (nullable = true)
 |-- locus: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- is95CredibleSet: boolean (nullable = true)
 |    |    |-- is99CredibleSet: boolean (nullable = true)
 |    |    |-- logBF: double (nullable = true)
 |    |    |-- posteriorProbability: double (nullable = true)
 |    |    |-- variantId: string (nullable = true)
 |    |    |-- pValueMantissa: float (nullable = true)
 |    |    |-- pValueExponent: integer (nullable = true)
 |    |    |-- beta: double (nullable = true)
 |    |    |-- standardError: double (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)
 |-- strongestLocus2gene: struct (nullable = true)
 |    |-- geneId: string (nullable = true)
 |    |-- score: double (nullable = true)
 |-- studyType: string (nullable = true)
 |-- diseaseIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- qtlGeneId: string (nullable = true)

Question about `studyTypes`:

For the enum, these are the current values of study types: tuqtl, gwas, sqtl, pqtl, eqtl the distribution of the labels:

+---------+-------+
|studyType|  count|
+---------+-------+
|    tuqtl| 411170|
|     gwas| 793759|
|     sqtl| 234066|
|     pqtl|   1772|
|     eqtl|1495303|
+---------+-------+

buniello commented 2 months ago

Discussed on slack: we should change field name credibleSetId (all occurrences) to studyLocusId in credibleSet API to avoid confusion (especially for FE development)

DSuveges commented 2 months ago

Based on discussion with @d0choa and @jdhayhurst : as part of planning evidence and l2g integration, some updates are proposed on the credible set data ingestion and modelling:

A new l2g dataset is proposed with studyLocusId, targetId, l2g prediction and potentially features.
In the credible set widget on the variant page, the top l2g prediction and gene is expected to be resolved against the planned l2g dataset based on studyLocusId and pulling the highest l2g score.

prashantuniyal02 commented 1 week ago

All possible list of studyTypes are: ['tuqtl', 'pqtl', 'eqtl', 'sqtl', 'sctuqtl', 'scpqtl', 'sceqtl', 'scsqtl', 'gwas']

opentargets / issues