d0choa commented 2 years ago

Background

For context, GWAS studies have a unique study_id for each analysis. In cases like the UKBB or GWAS catalogue studies with multiple GWASes, the ID is amended with an integer capturing a unique identifier for the trait under study (egNEALE_1). This helps to keep the uniqueness of the GWAS and we can trace all the study details with a single study_id.

Molecular trait studies (eQTL, pQTL, sQTL), contain a few extra considerations to capture all the granularity of the study. A single study (e.g. study_id: GTEx-eQTL), might contain multiple phenotypes (e.g. transcripts, proteins, splice sites) and multiple biofeatures (e.g. tissues, cell types). Differently from the GWAS studies, if they come from the same publication/data release" they are all captured in a single study_id (e.g. GTEx-eQTL). This lack of granularity implies that the study_id is not enough to characterise the trait and we start carrying over a lot of metadata throughout pipelines, API and FE.

For example in the next credible_set entry:

{
  "bio_feature": "MONOCYTE_IFN24",
  "is95_credset": true,
  "is99_credset": true,
  "lead_alt": "A",
  "lead_chrom": "9",
  "lead_pos": 136350334,
  "lead_ref": "G",
  "lead_variant_id": "9:136350334:G:A",
  "logABF": 5.882305944201195,
  "multisignal_method": "conditional",
  "phenotype_id": "ILMN_1807044",
  "postprob": 0.022285632272371,
  "postprob_cumsum": 0.228452658965028,
  "study_id": "Fairfax_2014",
  "tag_alt": "G",
  "tag_beta": 0.13323,
  "tag_beta_cond": 0.13323,
  "tag_chrom": "9",
  "tag_pos": 136568428,
  "tag_pval": 2.89983e-07,
  "tag_pval_cond": 2.89983e-07,
  "tag_ref": "C",
  "tag_se": 0.0254746,
  "tag_se_cond": 0.0254746,
  "tag_variant_id": "9:136568428:C:G",
  "type": "eqtl"
}

The next set of fields are information that we carry over to describe the study. And we are planning to expand this to include gene_id when appropiate. More info in #2688

  "study_id": "Fairfax_2014",
  "bio_feature": "MONOCYTE_IFN24",
  "phenotype_id": "ILMN_1807044",
  "type": "eqtl"

In coloc, because we compare 2 studies this is even more dramatic and it adds extra complication on the handling of unique studies (example)

{
  "coloc_n_vars": 2609,
  "coloc_h0": 0.010606728230554743,
  "coloc_h1": 0.6979127411053581,
  "coloc_h2": 0.003057284630829974,
  "coloc_h3": 0.20107910895401118,
  "coloc_h4": 0.08734413707924574,
  "left_study": "GCST90002334",
  "left_type": "gwas",
  "left_chrom": "9",
  "left_pos": 132990109,
  "left_ref": "T",
  "left_alt": "C",
  "right_study": "CEDAR",
  "right_type": "eqtl",
  "right_phenotype": "ILMN_1723418",
  "right_bio_feature": "RECTUM",
  "right_chrom": "9",
  "right_pos": 133007257,
  "right_ref": "T",
  "right_alt": "G",
  "coloc_h4_h3": 0.4343769849269733,
  "coloc_log2_h4_h3": -1.2029804296164983,
  "is_flipped": false,
  "right_gene_id": "ENSG00000170835",
  "left_var_right_study_beta": -0.0282431,
  "left_var_right_study_se": 0.0105022,
  "left_var_right_study_pval": 0.00762985,
  "left_var_right_isCC": false
}

Proposal

The proposal here is to create an appropriate study index capturing molecular trait metadata. This index will populate information based purely on a new study_id that capture all the molecular trait granularity. This will allow us to build a graphQL index that will resolve consistently the study entity making the data lighter and reducing the required logic in multiple places of the codebase. This will also help standardise the way study information is queried accross the codebase.

Considerations

Biofeature mappings There seems to be some unresolved issues around biofeature mappings that could be resolved as part of this work. The presence of the infamous hack, just seems to be a patch to cover the absence of appropiate data modelling. https://docs.google.com/document/d/1uf3NH0u87DYbk3Uf_rjxqMa5R7KdbOO4TvkBwkmc7Ss/edit
phenotype_id -> gene_id We have done the phenotype_id -> gene_id mapping in multiple places based on sometimes incomplete lookup tables. This is an opportunity to resolve this issue. There are currently several locations containing LUTs some with incomplete mappings. Some background is available in #2670
Search Changes in the study index can affect how search behaves. We will need to review this when the time comes.

@DSuveges can you help me review this ticket, scope it and assign appropiately?

d0choa commented 1 year ago

@DSuveges could you copy this info somewhere and close the ticket as well? (we have a few of these)

DSuveges commented 1 month ago

With the current QTL ingestion and the planned biofeature index, this is not an issue.

opentargets / issues

Improve representation molecular trait studies in study index and downstream #2690

Background

Proposal

Considerations