Integrating and displaying splice QTL (sQTL) data into the Genetics Portal

buniello commented 2 years ago

As a developer/owner, I should pin down the best way to integrate and display to the users the new splice QTL datasets into the Genetics Portal.

The sQTLs contribute to the new V2G pipeline.

[ ] BE team to make available development ES and CH images with sQTLs.

sQTLs in the Gene Page (from Jeremy's ppt)

[ ] Add as rows in gene page colocalisation analysis

sQTLs in the Variant Page (from Jeremy's ppt)

[ ] Add as a column in the assigned genes summary
[ ] Add as a tab in the assigned genes

sQTLs in the Study-Locus page (from Jeremy's ppt)

[ ] Add as rows in heatmap and in table

[ ] Add as rows in as rows in the credible set overlap

[ ] Incorporated into L2G scores, but no visible change

Acceptance tests

How do we know the task is complete?

When sQTL data are integrated into the Genetics Platform
When all relevant tables and tabs clearly display sQTL data to users

d0choa commented 2 years ago

@buniello and I made another pass to the spec to fill some gaps.

Here, we have used the production API and queries extracted from the production FE to try to explain the changes that we might need. In order to complete all the UI changes specified above, we need to expose the new data that @JarrodBaker has processed.

Metadata about the studies

On all variant page information, there is a query that retrieves metadata about the studies (independently of the query). This static content is currently served through the API. The information required seems to be loaded from v2g_display_labels.json.

[x] @buniello we will need a new record with the metadata about the sQTLs

API query:

query VariantPageQuery {
  genesForVariantSchema {
    qtls {
      id
      sourceId
      sourceLabel
      sourceDescriptionOverview
      sourceDescriptionBreakdown
      pmid
      tissues {
        id
        name
      }
    }
    intervals {
      sourceId
      sourceLabel
      sourceDescriptionOverview
      sourceDescriptionBreakdown
      pmid
      tissues {
        id
        name
      }
    }
    functionalPredictions {
      id
      sourceId
      sourceLabel
      sourceDescriptionOverview
      sourceDescriptionBreakdown
      pmid
      tissues {
        id
        name
      }
    }
    distances {
      id
      sourceId
      sourceLabel
      sourceDescriptionOverview
      sourceDescriptionBreakdown
      pmid
      tissues {
        id
        name
      }
    }
  }
}

API response:

{
  "data": {
    "genesForVariantSchema": {
      "qtls": [
        {
          "id": "pqtl",
          "sourceId": "pqtl",
          "sourceLabel": "pQTL (Sun, 2018)",
          "sourceDescriptionOverview": "Summary of evidence linking this variant to protein abundance in blood plasma",
          "sourceDescriptionBreakdown": "Evidence linking this variant to protein abundance in Sun *et al.* (2018) pQTL data",
          "pmid": "PMID:29875488",
          "tissues": [
            {
              "id": "FOLKERSEN_2020-UBERON_0001969",
              "name": "Folkersen 2020-uberon 0001969"
            },
...

We need to find and complete this data. It's likely to be an input of the data joining step. It might have been completed already.

Where we think the metadata of the studies is stored: https://github.com/opentargets/genetics-api/blob/master/resources/v2g_display_labels.json

QTL data

The query for the sQTLs seems no different than for other QTLs. We expect data should flow without any extra API changes. For an example with eQTL data see:

query VariantPageQuery {
  genesForVariant(variantId: "1_154453788_C_T") {
    gene {
      id
      symbol
    }
    overallScore
    qtls {
      sourceId
      aggregatedScore
      tissues {
        tissue {
          id
          name
        }
        quantile
        beta
        pval
      }
    }
    intervals {
      sourceId
      aggregatedScore
      tissues {
        tissue {
          id
          name
        }
        quantile
        score
      }
    }
    functionalPredictions {
      sourceId
      aggregatedScore
      tissues {
        tissue {
          id
          name
        }
        maxEffectLabel
        maxEffectScore
      }
    }
    distances {
      typeId
      sourceId
      aggregatedScore
      tissues {
        tissue {
          id
          name
        }
        distance
        score
        quantile
      }
    }
  }
}

The response looks like the next, in which IL6R has eQTL information. We are expecting to have the sQTL in a similar format in order to unblock the UI development

I think this info should be enough to start the process but you might need to ping @JarrodBaker and/or @carcruz to resolve specific issues.

buniello commented 2 years ago

Added new record with the metadata about the sQTLs in v2g_display_labels.json - this include text for tooltip.

buniello commented 2 years ago

Discussing with @remo87:

For a fixed Gene/ENSEMBLID, the sQTL results should be aggregated in the same row for different chr_localisation_clusters (curly bracket) -- similar to aggregating for eQTLs in row n1 below: image (7).png

NB: The API returns separate queries for each Chr_cluster_gene endpoint - the aggregation happens in FE.

image (8).png

[ ] The API response above also returns data for the hovering that needs to be shown on each dot in heatmap -- example below PLUS we will need to add chr_cluster junction details (from the field:

phenotypeId": "chr1^54605209^54607134^clu_47022^ENSG00000162390



![Screenshot 2022-07-27 at 10.09.49.png](https://images.zenhubusercontent.com/5ef1da283f096f8317c9ca44/1a7caa1f-6881-45ba-82b5-5ad8a38c561d)

buniello commented 2 years ago

Latest update on this task: @xyg123 is currently investigating whether the prototype shown above covers all use cases for sQTLs. g. does each cluster only map to one gene in our dataset? Can one gene host multiple clusters? If so, how do we display these odd datasets?

buniello commented 2 years ago

From @xyg123: Here are the results for sQTL merging, I think merging it should be fine for now, although we should re-visit this data when we add additional sources in the future.

Showing: only best/most significant cluster within same junction.

buniello commented 2 years ago

Discussed with the team: the API will be slightly modified so that the sQTL data can conform to the schema. This means that the current phenotypeId filed will be split into two columns: phenotypeId: ENSGID spliceId: chr1^54605209^54607134^clu_47022 Hovering text: log2(H4/H3):, H3:, H4:, QTLbeta:, spliceId

ireneisdoomed commented 2 years ago

@buniello Just a very minor opinion as a user. What do you think if the tooltip showed the metrics in separate rows instead of having them separated by commas? Btw, for eQTL and pQTLs the metrics are just separated by a space.

buniello commented 2 years ago

@ireneisdoomed this is a good observation! - You suggest having spliceId in a separate row right? We have commas also for eQTLs and pQTLs i think? A hovering example from other QTLs is in one of the screenshots above.

ireneisdoomed commented 2 years ago

@buniello No, I was proposing for the hovering text to be:

log2(H4/H3): {value}
H3: {value}
H4: {value}
QTL beta: {value}
spliceId: {value}

My main argument is that the splice ID is a fairly long string. However, I don't know if with a longer hover text we would frequently obstruct the visibility of other circles.

buniello commented 2 years ago

Discussed with @remo87 already:

[ ] all tables displaying sQTL data (locus page/coloc table, gene page/coloc) should also show just the gene "id" without the custer/splice "Id" for each independent row (representing different tissues). -- The 'splice id' (not hugely meaningful to users anyway) will only be displayed in the heatmap tooltips for each point in the aggregation.

Example below of locus page/gene prioritisation coloc table: Screenshot 2022-08-25 at 09.48.11.png

opentargets / issues