opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

`credible set` data for backend integration #3379

Open d0choa opened 4 months ago

d0choa commented 4 months ago

Similar to the work on variant index (#3350) and study index (#3357), we would like to serve a credible_set dataset through OS + API.

This dataset is created by different upstream ETL processes (gentropy), but they all write to the exact location and contain a shared schema. So effectively, they can be considered one single dataset that we would like to load:

❯ gsutil ls gs://genetics_etl_python_playground/releases/24.06/credible_set/
gs://genetics_etl_python_playground/releases/24.06/credible_set/eqtl_catalogue_susie/
gs://genetics_etl_python_playground/releases/24.06/credible_set/finngen_susie/
gs://genetics_etl_python_playground/releases/24.06/credible_set/gwas_catalog_PICSed_curated_associations/
gs://genetics_etl_python_playground/releases/24.06/credible_set/gwas_catalog_PICSed_summary_statistics/

Note that these datasets are currently pending validation and will be implemented in the gentropy layer. Records are expected to have:

(This has been discussed with @DSuveges but there is not ticket yet)

The credible sets use the study_locus as the reference schema.

Some stats (they might float in future iterations but not significantly):

Parquet size: 4.1 GiB Rows (credible sets): 2_982_370

I'm making some comments in the next schema:

root
 |-- variantId: string (nullable = true) => Resolve using new variant index (#3350)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- studyId: string (nullable = true) => Resolve using new study index (#3357)
 |-- beta: double (nullable = true)
 |-- pValueMantissa: float (nullable = true)
 |-- pValueExponent: integer (nullable = true)
 |-- standardError: double (nullable = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetIndex: integer (nullable = true)
 |-- locus: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- variantId: string (nullable = true) => Resolve using new variant index (#3350)
 |    |    |-- posteriorProbability: double (nullable = true)
 |    |    |-- pValueMantissa: float (nullable = true)
 |    |    |-- pValueExponent: integer (nullable = true)
 |    |    |-- logBF: double (nullable = true)
 |    |    |-- beta: double (nullable = true)
 |    |    |-- standardError: double (nullable = true)
 |    |    |-- is95CredibleSet: boolean (nullable = true)
 |    |    |-- is99CredibleSet: boolean (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)
 |-- studyLocusId: long (nullable = true) => This will define unique schemas
 |-- credibleSetlog10BF: double (nullable = true)
 |-- effectAlleleFrequencyFromSource: float (nullable = true)
 |-- zScore: double (nullable = true)
 |-- subStudyDescription: string (nullable = true)
 |-- qualityControls: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- purityMeanR2: double (nullable = true)
 |-- purityMinR2: double (nullable = true)
 |-- sampleSize: integer (nullable = true)
 |-- ldSet: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- tagVariantId: string (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)

We want a top-level credible_set endpoint, but this dataset will need to be queried in additional ways, which we can define later.

DSuveges commented 4 months ago

If we want to ensure uniqueness of the dataset based on studyLocusId, it should be the generated from the following fields:

However, the process is not clear eg. the window based clumping would generate study locus with some id, but then it should be overwritten once a downstream finemapping process picks up and does the finemapping.

remo87 commented 3 months ago

@DSuveges could you help me with a copy of the data in json format please

buniello commented 3 months ago

Some planned changes to data/schema are expected on this front. @DSuveges will list them here when plan is finalised.

DSuveges commented 3 months ago

Some updates on the credible set schema:

DSuveges commented 3 months ago

There's an updated credible set dataset in json for the backend team to ingest: gs://ot-team/dsuveges/credible_sets.json

Number of credible sets: 3,019,002 The distribution of credible sets across study types:

+---------+-------+
|studyType|  count|
+---------+-------+
|    tuqtl| 411170|
|     gwas| 876691|
|     sqtl| 234066|
|     pqtl|   1772|
|     eqtl|1495303|
+---------+-------+

Schema

Changes

root
 |-- studyId: string (nullable = true)
 |-- studyLocusId: string (nullable = true)
 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- beta: double (nullable = true)
 |-- zScore: double (nullable = true)
 |-- pValueMantissa: float (nullable = true)
 |-- pValueExponent: integer (nullable = true)
 |-- effectAlleleFrequencyFromSource: float (nullable = true)
 |-- standardError: double (nullable = true)
 |-- subStudyDescription: string (nullable = true)
 |-- qualityControls: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetIndex: integer (nullable = true)
 |-- credibleSetlog10BF: double (nullable = true)
 |-- purityMeanR2: double (nullable = true)
 |-- purityMinR2: double (nullable = true)
 |-- locusStart: integer (nullable = true)
 |-- locusEnd: integer (nullable = true)
 |-- sampleSize: integer (nullable = true)
 |-- locus: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- is95CredibleSet: boolean (nullable = true)
 |    |    |-- is99CredibleSet: boolean (nullable = true)
 |    |    |-- logBF: double (nullable = true)
 |    |    |-- posteriorProbability: double (nullable = true)
 |    |    |-- variantId: string (nullable = true)
 |    |    |-- pValueMantissa: float (nullable = true)
 |    |    |-- pValueExponent: integer (nullable = true)
 |    |    |-- beta: double (nullable = true)
 |    |    |-- standardError: double (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)
 |-- strongestLocus2gene: struct (nullable = true)
 |    |-- geneId: string (nullable = true)
 |    |-- score: double (nullable = true)
 |-- studyType: string (nullable = true)
 |-- traitFromSourceMappedIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- qtlGeneId: string (nullable = true)

How the data looks like: PICS credible sets

For clarity, the number of tag variants are truncated.

{
  "studyId": "GCST90105038",
  "studyLocusId": "-9221662644183536443",
  "variantId": "3_50557710_C_T",
  "chromosome": "3",
  "position": 50557710,
  "region": null,
  "beta": 0.0134335,
  "zScore": null,
  "pValueMantissa": 6.0,
  "pValueExponent": -15,
  "effectAlleleFrequencyFromSource": null,
  "standardError": null,
  "subStudyDescription": null,
  "qualityControls": [],
  "finemappingMethod": "pics",
  "credibleSetIndex": null,
  "credibleSetlog10BF": null,
  "purityMeanR2": null,
  "purityMinR2": null,
  "locusStart": null,
  "locusEnd": null,
  "sampleSize": null,
  "locus": [
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": null,
      "posteriorProbability": 0.8281915098204469,
      "variantId": "3_50557710_C_T",
      "pValueMantissa": null,
      "pValueExponent": null,
      "beta": null,
      "standardError": 0.9999996011812708,
      "r2Overall": 1.0000000000000027
    },
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": null,
      "posteriorProbability": 0.018635570174451596,
      "variantId": "3_50972004_C_T",
      "pValueMantissa": null,
      "pValueExponent": null,
      "beta": null,
      "standardError": 0.0387769970112231,
      "r2Overall": 0.7735468485352232
    },
  ],
  "strongestLocus2gene": {
    "geneId": "ENSG00000088538",
    "score": 0.7871325016021729
  },
  "studyType": "gwas",
  "traitFromSourceMappedIds": [
    "EFO_0011015"
  ],
  "qtlGeneId": null
}

How the data looks like: SuSiE credible sets

{
  "studyId": "FINNGEN_R10_C3_RECTUM_ADENO_MUCINO_EXALLC",
  "studyLocusId": "-9131389010760691102",
  "variantId": "18_48925503_G_A",
  "chromosome": "18",
  "position": 48925503,
  "region": "chr18:47425503-50425503",
  "beta": -0.19892,
  "zScore": null,
  "pValueMantissa": 2.0810000896453857,
  "pValueExponent": -12,
  "effectAlleleFrequencyFromSource": 0.49358999729156494,
  "standardError": 0.0283,
  "subStudyDescription": null,
  "qualityControls": null,
  "finemappingMethod": "SuSie",
  "credibleSetIndex": 1,
  "credibleSetlog10BF": 5.91695653245,
  "purityMeanR2": null,
  "purityMinR2": null,
  "locusStart": null,
  "locusEnd": null,
  "sampleSize": null,
  "locus": [
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": 22.2531357731279,
      "posteriorProbability": 0.254337321094081,
      "variantId": "18_48925503_G_A",
      "pValueMantissa": 2.0810000896453857,
      "pValueExponent": -12,
      "beta": -0.19892,
      "standardError": 0.0283,
      "r2Overall": null
    },
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": 22.1759052599176,
      "posteriorProbability": 0.23543406807608,
      "variantId": "18_48925435_C_CG",
      "pValueMantissa": 2.255000114440918,
      "pValueExponent": -12,
      "beta": -0.198598,
      "standardError": 0.0282994,
      "r2Overall": null
    }
  ],
  "strongestLocus2gene": {
    "geneId": "ENSG00000101665",
    "score": 0.9262983798980713
  },
  "studyType": "gwas",
  "traitFromSourceMappedIds": null,
  "qtlGeneId": null
}

How the data looks like: QTL studies


{
  "studyId": "GTEx_esophagus_muscularis_ENSG00000105483.grp_2.upstream.ENST00000391898",
  "studyLocusId": "-9223054931290123158",
  "variantId": "19_48204408_G_A",
  "chromosome": "19",
  "position": 48204408,
  "region": "chr19:47255946-49255946",
  "beta": -0.274482,
  "zScore": null,
  "pValueMantissa": 1.0399999618530273,
  "pValueExponent": -7,
  "effectAlleleFrequencyFromSource": null,
  "standardError": 0.0507531,
  "subStudyDescription": null,
  "qualityControls": null,
  "finemappingMethod": "SuSie",
  "credibleSetIndex": 1,
  "credibleSetlog10BF": 14.906432151794434,
  "purityMeanR2": null,
  "purityMinR2": null,
  "locusStart": null,
  "locusEnd": null,
  "sampleSize": null,
  "locus": [
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": 12.9898148734311,
      "posteriorProbability": 0.143881970498929,
      "variantId": "19_48204408_G_A",
      "pValueMantissa": 1.0399999618530273,
      "pValueExponent": -7,
      "beta": -0.274482,
      "standardError": 0.0507531,
      "r2Overall": null
    },
    {
      "is95CredibleSet": true,
      "is99CredibleSet": true,
      "logBF": 12.9898148734311,
      "posteriorProbability": 0.143881970498929,
      "variantId": "19_48204483_G_A",
      "pValueMantissa": 1.0399999618530273,
      "pValueExponent": -7,
      "beta": -0.274482,
      "standardError": 0.0507531,
      "r2Overall": null
    }
  ],
  "strongestLocus2gene": null,
  "studyType": "tuqtl",
  "traitFromSourceMappedIds": null,
  "qtlGeneId": "ENSG00000105483"
}
DSuveges commented 3 months ago

@remo87 @jdhayhurst : In this schema there are a handful of identifiers to resolve:

d0choa commented 3 months ago

Some notes with @jdhayhurst on the credible set endpoints

Study page:

Variant page:

Disease page:

Root-level endpoint:

I don't think we are missing anything obvious but it might be good to have a look @addramir & @DSuveges

d0choa commented 2 months ago

Update 14/08: Hoping for a first iteration by Friday

DSuveges commented 2 months ago

New iteration of the credible set dataset

@remo87 The JSON formatted data is here: gs://ot-team/dsuveges/credible_sets.json. Schema change: instead of traitFromSourceMappedIds field there's a diseaseIds field containing a list of validated EFO IDs, that should be resolvable against the disease index.

Important - Although the dataset was generated in a validation step, the flagged credible sets are still in the dataset except those, which don't have variantId.

Schema:

root
 |-- studyId: string (nullable = true)
 |-- studyLocusId: string (nullable = true)
 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- beta: double (nullable = true)
 |-- zScore: double (nullable = true)
 |-- pValueMantissa: float (nullable = true)
 |-- pValueExponent: integer (nullable = true)
 |-- effectAlleleFrequencyFromSource: float (nullable = true)
 |-- standardError: double (nullable = true)
 |-- subStudyDescription: string (nullable = true)
 |-- qualityControls: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetIndex: integer (nullable = true)
 |-- credibleSetlog10BF: double (nullable = true)
 |-- purityMeanR2: double (nullable = true)
 |-- purityMinR2: double (nullable = true)
 |-- locusStart: integer (nullable = true)
 |-- locusEnd: integer (nullable = true)
 |-- sampleSize: integer (nullable = true)
 |-- locus: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- is95CredibleSet: boolean (nullable = true)
 |    |    |-- is99CredibleSet: boolean (nullable = true)
 |    |    |-- logBF: double (nullable = true)
 |    |    |-- posteriorProbability: double (nullable = true)
 |    |    |-- variantId: string (nullable = true)
 |    |    |-- pValueMantissa: float (nullable = true)
 |    |    |-- pValueExponent: integer (nullable = true)
 |    |    |-- beta: double (nullable = true)
 |    |    |-- standardError: double (nullable = true)
 |    |    |-- r2Overall: double (nullable = true)
 |-- strongestLocus2gene: struct (nullable = true)
 |    |-- geneId: string (nullable = true)
 |    |-- score: double (nullable = true)
 |-- studyType: string (nullable = true)
 |-- diseaseIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- qtlGeneId: string (nullable = true)

Question about studyTypes:

For the enum, these are the current values of study types: tuqtl, gwas, sqtl, pqtl, eqtl the distribution of the labels:

+---------+-------+
|studyType|  count|
+---------+-------+
|    tuqtl| 411170|
|     gwas| 793759|
|     sqtl| 234066|
|     pqtl|   1772|
|     eqtl|1495303|
+---------+-------+
buniello commented 2 months ago

Discussed on slack: we should change field name credibleSetId (all occurrences) to studyLocusId in credibleSet API to avoid confusion (especially for FE development)

DSuveges commented 2 months ago

Based on discussion with @d0choa and @jdhayhurst : as part of planning evidence and l2g integration, some updates are proposed on the credible set data ingestion and modelling: