Closed buniello closed 1 year ago
TL;DR The issue responds to a design choice on how we define the top loci vs the credible sets. This problem is more accute in Finngen
In genetics portal, we create a region in the genome where a particular peak was found. Finemapping also creates the concept of credible set which is a more granular region potentially splitting independent signals based on LD. Now comes the interesting part. The process we have defines as top loci (lead) the variant with the lowest posterior probably on the region. However, the credible set lead variant is the variant with the lowest posterior probability in a potentially more granular region, the credible set. The 2 defintions of lead agree very often but they don’t have to. The problem described emerged because we define top loci based on the region, but we calculate colocalisation on lead variants from the credible set. This is an overall inconsistency but it’s not dramatic. 2k/600k leads are in the credible set but not in the top loci table. Unless you focus on Finngen. Because Finngen uses Sussie for finemapping, it does not assume one single causal variant per region and splits more often the regions into different credible sets. This means that we have a much higher proportion of disagreeing lead variants. Around 1k out of 6k finngen lead variants according to the credible sets are ignored in V2D, because they are leads of a credible set but not top loci variants.
For sure, the two approaches are based on two different assumptions. But it leaves us with two questions to answer before we can move forward with the streamlining of the release process:
CS
- credible set identifier)@MayaGhoussaini what is your opinions on the above questions?
After the discussion with @MayaGhoussaini there were a few things we agreed upon:
The question was to double check how prevalent this discrepancy in our data.
+------------+---------------+-----------+
|study_source|missing_credset|study_count|
+------------+---------------+-----------+
|GWAS_CATALOG| 10| 9|
| FINNGEN| 1585| 320|
+------------+---------------+-----------+
It seems over 15 hundred credible sets from 330 studies do not have top-loci. As expected it mostly occur in finngen.
Selected study:
gs://finngen-public-data-r6/finemapping/full/finngen_R6_C3_MALE_GENITAL.SUSIE.snp.bgz
This chromosome has a number of credible sets and a number of them are missing from the toploci table. Data aggregation steps:
lead_variant_id
to join top-loci table with finemapping table.lead_variant_id
to join with source data.+------------------------+---------------+---------------+-----------------+-------------+------------------------+
|region |credible_set_id|lead_id |tag_variant_count|is_in_toploci|cred_set_minimum_p-value|
+------------------------+---------------+---------------+-----------------+-------------+------------------------+
|chr8:123638420-128340365|4 |8_127401060_G_T|3 |false |1.14E-25 |
|chr8:123638420-128340365|1 |8_127091872_A_G|1 |true |6.69E-88 |
|chr8:123638420-128340365|3 |8_127064901_G_A|1 |false |3.33E-41 |
|chr8:123638420-128340365|2 |8_127506309_C_A|10 |false |9.02E-77 |
|chr8:123638420-128340365|7 |8_127528531_C_G|1 |false |2.08E-27 |
|chr8:123638420-128340365|8 |8_127311574_C_T|9 |false |4.57E-25 |
|chr8:123638420-128340365|6 |null |null |null |7.68E-8 |
|chr8:123638420-128340365|-1 |null |null |null |2.53E-88 |
|chr8:123638420-128340365|5 |8_127015709_G_A|7 |false |2.9E-21 |
|chr8:123638420-128340365|9 |null |null |null |2.54E-4 |
|chr8:128340365-128679427|1 |8_128630605_C_A|3 |true |3.87E-11 |
|chr8:128340365-128679427|-1 |null |null |null |1.29E-6 |
|chr8:22149459-25149459 |1 |8_23649459_A_G |28 |true |7.33E-22 |
|chr8:22149459-25149459 |-1 |null |null |null |5.61E-21 |
|chr8:22149459-25149459 |2 |null |null |null |1.36E-13 |
+------------------------+---------------+---------------+-----------------+-------------+------------------------+
Let's see the region chr8:123638420-128340365
:
-1
). Let's see region chr8:128340365-128679427
:
Let's see region chr8:22149459-25149459
:
(Notebook here)
I think I have extensively explored the root causes of this issue and could identify the problem, however, this won't be an issue with the next iteration of the pipelines. Therefore I'm closing the ticket.
The table "Associated studies: Colocalisation analysis" is listing which studies have evidence of colocalisation with molecular QTLs for a given gene (e.g.APOE).
We have noticed that (in both dev and production environments) the
gene prioritisation
tab linking out from that table brings to a page with no data for some FINNGEN studies. This happens when:Working Example -- The 'gene prioritisation' links from FINNGEN_R5_RX_STATIN studies in the APOE page below crashes: https://genetics.opentargets.org/gene/ENSG00000130203 Page 981-986 19_44935906_C_G
Coloc results listing FINNGEN_R5_RX_STATIN as left study and APOE as right have been queried. 5 results have same lead variant in each side. This case crashes in the FE when clicking on gene prioritisation link.
Possible reason: That variant (19_44935906_C_G) is not available in V2D.
Though it is available in FINNGEN R5
Action items (as also discussed with David and on slack channels)
[ ] Data team to dig into the FINGENN data at different processing stages to identify why such top loci are not available in V2D.
[ ] BE to create a
right variant
API endpoint to the query for theAssociated studies: Colocalisation analysis
on gene page (data is already available), so that[ ] FE can add a new
QTL lead variant
column/header to the table (a sub-ticket will also be created to cover this task).Current query for the
Associated studies: Colocalisation analysis
on gene page: