Missing top loci crash application on gene colocalisation table

The table "Associated studies: Colocalisation analysis" is listing which studies have evidence of colocalisation with molecular QTLs for a given gene (e.g.APOE).

We have noticed that (in both dev and production environments) the gene prioritisation tab linking out from that table brings to a page with no data for some FINNGEN studies. This happens when:

the left variant (GWAS lead variant for the GWAS study with evidence of co-localisation) and the right variant (lead variant from QTL study with evidence of co-localisation with GWAS study) are ultimately the same variant.

Working Example -- The 'gene prioritisation' links from FINNGEN_R5_RX_STATIN studies in the APOE page below crashes: https://genetics.opentargets.org/gene/ENSG00000130203 Page 981-986 19_44935906_C_G

Coloc results listing FINNGEN_R5_RX_STATIN as left study and APOE as right have been queried. 5 results have same lead variant in each side. This case crashes in the FE when clicking on gene prioritisation link.


SELECT * FROM `bigquery-public-data.open_targets_genetics.variant_disease_coloc` WHERE `left_study` = "FINNGEN_R5_RX_STATIN" AND `right_phenotype` = "ENSG00000130203" LIMIT 10

Possible reason: That variant (19_44935906_C_G) is not available in V2D.


SELECT * FROM `bigquery-public-data.open_targets_genetics.variant_disease` WHERE `study_id` = "FINNGEN_R5_RX_STATIN" AND `lead_pos` = 44935906 LIMIT 10

Though it is available in FINNGEN R5 Screenshot 2022-08-26 at 11.43.19.png

Action items (as also discussed with David and on slack channels)

[ ] Data team to dig into the FINGENN data at different processing stages to identify why such top loci are not available in V2D.
[ ] BE to create a right variant API endpoint to the query for the Associated studies: Colocalisation analysis on gene page (data is already available), so that
[ ] FE can add a new QTL lead variant column/header to the table (a sub-ticket will also be created to cover this task).

Current query for the Associated studies: Colocalisation analysis on gene page:


query GenePageQuery {
  colocalisationsForGene(geneId: "ENSG00000130203") {
    leftVariant {
      id
      rsId
      __typename
    }
    study {
      studyId
      traitReported
      pubAuthor
      pubDate
      pmid
      hasSumstats
      __typename
    }
    qtlStudyId
    phenotypeId
    tissue {
      id
      name
      __typename
    }
    h3
    h4
    log2h4h3
    __typename
  }
}

TL;DR The issue responds to a design choice on how we define the top loci vs the credible sets. This problem is more accute in Finngen

In genetics portal, we create a region in the genome where a particular peak was found. Finemapping also creates the concept of credible set which is a more granular region potentially splitting independent signals based on LD. Now comes the interesting part. The process we have defines as top loci (lead) the variant with the lowest posterior probably on the region. However, the credible set lead variant is the variant with the lowest posterior probability in a potentially more granular region, the credible set. The 2 defintions of lead agree very often but they don’t have to. The problem described emerged because we define top loci based on the region, but we calculate colocalisation on lead variants from the credible set. This is an overall inconsistency but it’s not dramatic. 2k/600k leads are in the credible set but not in the top loci table. Unless you focus on Finngen. Because Finngen uses Sussie for finemapping, it does not assume one single causal variant per region and splits more often the regions into different credible sets. This means that we have a much higher proportion of disagreeing lead variants. Around 1k out of 6k finngen lead variants according to the credible sets are ignored in V2D, because they are leads of a credible set but not top loci variants.

For sure, the two approaches are based on two different assumptions. But it leaves us with two questions to answer before we can move forward with the streamlining of the release process:

Can we assume that the toploci list is a completely overlapping set with lead variants of the coloc dataset? I would assume so, and I think the generation of the two datasets should depend on each other. However I'm not 100% sure if this is a scientifically correct assumption.
If the answer is yes, which currently implemented logic should be kept? I more keen towards kepping all the lead variants from every credible sets even if they are located on the same region.

This line shows how the top-loci is generated based on region. While losing the granularity of credible sets.
This line shows the credible set generation keeps the full granularity (CS - credible set identifier)

@MayaGhoussaini what is your opinions on the above questions?

After the discussion with @MayaGhoussaini there were a few things we agreed upon:

Credible sets have one lead variant.
The lead variant represents the credible set in the top loci table.
Therefore all lead variants need to be present in the top loci table.

The question was to double check how prevalent this discrepancy in our data.

How many credible set not represented in the top-loci table:

+------------+---------------+-----------+
|study_source|missing_credset|study_count|
+------------+---------------+-----------+
|GWAS_CATALOG|             10|          9|
|     FINNGEN|           1585|        320|
+------------+---------------+-----------+

It seems over 15 hundred credible sets from 330 studies do not have top-loci. As expected it mostly occur in finngen.

How the source data looks?

Selected study:

finngen_R6_C3_MALE_GENITAL
dataset location: gs://finngen-public-data-r6/finemapping/full/finngen_R6_C3_MALE_GENITAL.SUSIE.snp.bgz
chromosome: 8.

This chromosome has a number of credible sets and a number of them are missing from the toploci table. Data aggregation steps:

Use lead_variant_id to join top-loci table with finemapping table.
Use the source data to bring in the annotated region, credible set identifier and the lowest p-value in the credible set.
Using lead_variant_id to join with source data.

+------------------------+---------------+---------------+-----------------+-------------+------------------------+
|region                  |credible_set_id|lead_id        |tag_variant_count|is_in_toploci|cred_set_minimum_p-value|
+------------------------+---------------+---------------+-----------------+-------------+------------------------+
|chr8:123638420-128340365|4              |8_127401060_G_T|3                |false        |1.14E-25                |
|chr8:123638420-128340365|1              |8_127091872_A_G|1                |true         |6.69E-88                |
|chr8:123638420-128340365|3              |8_127064901_G_A|1                |false        |3.33E-41                |
|chr8:123638420-128340365|2              |8_127506309_C_A|10               |false        |9.02E-77                |
|chr8:123638420-128340365|7              |8_127528531_C_G|1                |false        |2.08E-27                |
|chr8:123638420-128340365|8              |8_127311574_C_T|9                |false        |4.57E-25                |
|chr8:123638420-128340365|6              |null           |null             |null         |7.68E-8                 |
|chr8:123638420-128340365|-1             |null           |null             |null         |2.53E-88                |
|chr8:123638420-128340365|5              |8_127015709_G_A|7                |false        |2.9E-21                 |
|chr8:123638420-128340365|9              |null           |null             |null         |2.54E-4                 |
|chr8:128340365-128679427|1              |8_128630605_C_A|3                |true         |3.87E-11                |
|chr8:128340365-128679427|-1             |null           |null             |null         |1.29E-6                 |
|chr8:22149459-25149459  |1              |8_23649459_A_G |28               |true         |7.33E-22                |
|chr8:22149459-25149459  |-1             |null           |null             |null         |5.61E-21                |
|chr8:22149459-25149459  |2              |null           |null             |null         |1.36E-13                |
+------------------------+---------------+---------------+-----------------+-------------+------------------------+

Let's see the region chr8:123638420-128340365:

In this region the source data defines 9 credible sets (and an invalid one with the id -1).
Only one of these are in the top-loci table.
Two of them are not in our fine-mapping table, with id:9 and id:6.
Both of those credible sets have sub significant p-values

Let's see region chr8:128340365-128679427:

One valid credible set from the region.
The lead variant in the toploci table.

Let's see region chr8:22149459-25149459:

Two valid credible sets in the region.
Only one of them is in the finemapping table, which is in the toploci table as well.
However the other valid credible set has significant p-value.

Conclusion:

It seems we have toploci from every region.
Some credible sets don't make it to the finemapping table.

(Notebook here)

I think I have extensively explored the root causes of this issue and could identify the problem, however, this won't be an issue with the next iteration of the pipelines. Therefore I'm closing the ticket.

opentargets / issues