opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Missing top loci crash application on gene colocalisation table #2710

Closed buniello closed 1 year ago

buniello commented 2 years ago

The table "Associated studies: Colocalisation analysis" is listing which studies have evidence of colocalisation with molecular QTLs for a given gene (e.g.APOE).

We have noticed that (in both dev and production environments) the gene prioritisation tab linking out from that table brings to a page with no data for some FINNGEN studies. This happens when:

Working Example -- The 'gene prioritisation' links from FINNGEN_R5_RX_STATIN studies in the APOE page below crashes: https://genetics.opentargets.org/gene/ENSG00000130203 Page 981-986 19_44935906_C_G

Coloc results listing FINNGEN_R5_RX_STATIN as left study and APOE as right have been queried. 5 results have same lead variant in each side. This case crashes in the FE when clicking on gene prioritisation link.


SELECT * FROM `bigquery-public-data.open_targets_genetics.variant_disease_coloc` WHERE `left_study` = "FINNGEN_R5_RX_STATIN" AND `right_phenotype` = "ENSG00000130203" LIMIT 10

Possible reason: That variant (19_44935906_C_G) is not available in V2D.


SELECT * FROM `bigquery-public-data.open_targets_genetics.variant_disease` WHERE `study_id` = "FINNGEN_R5_RX_STATIN" AND `lead_pos` = 44935906 LIMIT 10

Though it is available in FINNGEN R5 Screenshot 2022-08-26 at 11.43.19.png

Action items (as also discussed with David and on slack channels)

Current query for the Associated studies: Colocalisation analysis on gene page:


query GenePageQuery {
  colocalisationsForGene(geneId: "ENSG00000130203") {
    leftVariant {
      id
      rsId
      __typename
    }
    study {
      studyId
      traitReported
      pubAuthor
      pubDate
      pmid
      hasSumstats
      __typename
    }
    qtlStudyId
    phenotypeId
    tissue {
      id
      name
      __typename
    }
    h3
    h4
    log2h4h3
    __typename
  }
}
d0choa commented 2 years ago

TL;DR The issue responds to a design choice on how we define the top loci vs the credible sets. This problem is more accute in Finngen

In genetics portal, we create a region in the genome where a particular peak was found. Finemapping also creates the concept of credible set which is a more granular region potentially splitting independent signals based on LD. Now comes the interesting part. The process we have defines as top loci (lead) the variant with the lowest posterior probably on the region. However, the credible set lead variant is the variant with the lowest posterior probability in a potentially more granular region, the credible set. The 2 defintions of lead agree very often but they don’t have to. The problem described emerged because we define top loci based on the region, but we calculate colocalisation on lead variants from the credible set. This is an overall inconsistency but it’s not dramatic. 2k/600k leads are in the credible set but not in the top loci table. Unless you focus on Finngen. Because Finngen uses Sussie for finemapping, it does not assume one single causal variant per region and splits more often the regions into different credible sets. This means that we have a much higher proportion of disagreeing lead variants. Around 1k out of 6k finngen lead variants according to the credible sets are ignored in V2D, because they are leads of a credible set but not top loci variants.

DSuveges commented 2 years ago

For sure, the two approaches are based on two different assumptions. But it leaves us with two questions to answer before we can move forward with the streamlining of the release process:

  1. Can we assume that the toploci list is a completely overlapping set with lead variants of the coloc dataset? I would assume so, and I think the generation of the two datasets should depend on each other. However I'm not 100% sure if this is a scientifically correct assumption.
  2. If the answer is yes, which currently implemented logic should be kept? I more keen towards kepping all the lead variants from every credible sets even if they are located on the same region.

@MayaGhoussaini what is your opinions on the above questions?

DSuveges commented 2 years ago

After the discussion with @MayaGhoussaini there were a few things we agreed upon:

The question was to double check how prevalent this discrepancy in our data.

How many credible set not represented in the top-loci table:

+------------+---------------+-----------+
|study_source|missing_credset|study_count|
+------------+---------------+-----------+
|GWAS_CATALOG|             10|          9|
|     FINNGEN|           1585|        320|
+------------+---------------+-----------+

It seems over 15 hundred credible sets from 330 studies do not have top-loci. As expected it mostly occur in finngen.

How the source data looks?

Selected study:

This chromosome has a number of credible sets and a number of them are missing from the toploci table. Data aggregation steps:

  1. Use lead_variant_id to join top-loci table with finemapping table.
  2. Use the source data to bring in the annotated region, credible set identifier and the lowest p-value in the credible set.
  3. Using lead_variant_id to join with source data.
+------------------------+---------------+---------------+-----------------+-------------+------------------------+
|region                  |credible_set_id|lead_id        |tag_variant_count|is_in_toploci|cred_set_minimum_p-value|
+------------------------+---------------+---------------+-----------------+-------------+------------------------+
|chr8:123638420-128340365|4              |8_127401060_G_T|3                |false        |1.14E-25                |
|chr8:123638420-128340365|1              |8_127091872_A_G|1                |true         |6.69E-88                |
|chr8:123638420-128340365|3              |8_127064901_G_A|1                |false        |3.33E-41                |
|chr8:123638420-128340365|2              |8_127506309_C_A|10               |false        |9.02E-77                |
|chr8:123638420-128340365|7              |8_127528531_C_G|1                |false        |2.08E-27                |
|chr8:123638420-128340365|8              |8_127311574_C_T|9                |false        |4.57E-25                |
|chr8:123638420-128340365|6              |null           |null             |null         |7.68E-8                 |
|chr8:123638420-128340365|-1             |null           |null             |null         |2.53E-88                |
|chr8:123638420-128340365|5              |8_127015709_G_A|7                |false        |2.9E-21                 |
|chr8:123638420-128340365|9              |null           |null             |null         |2.54E-4                 |
|chr8:128340365-128679427|1              |8_128630605_C_A|3                |true         |3.87E-11                |
|chr8:128340365-128679427|-1             |null           |null             |null         |1.29E-6                 |
|chr8:22149459-25149459  |1              |8_23649459_A_G |28               |true         |7.33E-22                |
|chr8:22149459-25149459  |-1             |null           |null             |null         |5.61E-21                |
|chr8:22149459-25149459  |2              |null           |null             |null         |1.36E-13                |
+------------------------+---------------+---------------+-----------------+-------------+------------------------+

Let's see the region chr8:123638420-128340365:

Let's see region chr8:128340365-128679427:

Let's see region chr8:22149459-25149459:

Conclusion:

(Notebook here)

DSuveges commented 1 year ago

I think I have extensively explored the root causes of this issue and could identify the problem, however, this won't be an issue with the next iteration of the pipelines. Therefore I'm closing the ticket.