opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Incorrect overlapping associations #3191

Closed ireneisdoomed closed 5 months ago

ireneisdoomed commented 5 months ago

Describe the bug There are two significant issues in the dataset related to the handling of overlapping associations and statistical probability calculations. Specifically, the dataset is incorrectly identifying overlapping associations where no common variant exists in the locus, and it is erroneously assigning a right_posteriorProbability value of 1.0 to a variant not present in the locus of rightStudyLocusId.

Observed behaviour I have looked at the credible set dataset and focused on these 2 associations:

  1. One of them has 1 variant in the locus, the other doesn’t.
    
    In [34]: subset_cs.df.filter(f.col("studyLocusId") == 512473136292631516).select("locus").show(truncate=False)
    +----------------------------------------------------------------------------------------------------+
    |locus                                                                                               |
    +----------------------------------------------------------------------------------------------------+
    |[{true, true, null, 1.0, 1_156645322_A_G, null, null, null, 0.9999996753603043, 0.9999999999999973}]|
    +----------------------------------------------------------------------------------------------------+

In [35]: subset_cs.df.filter(f.col("studyLocusId") == -3309244191177514893).select("locus").show(truncate=False) +-----+
|locus| +-----+ +-----+


1. When we look for overlapping variants, we find `1_156645322_A_G`

In [37]: subset_cs.find_overlaps(studies).df.filter(f.col("tagVariantId") == "1_156645322_A_G").show(truncate=False) 24/01/11 12:00:01 WARN CacheManager: Asked to cache already cached data. 24/01/11 12:00:01 WARN CacheManager: Asked to cache already cached data. +------------------+--------------------+----------+---------------+----------------------------------------------------------+ |leftStudyLocusId |rightStudyLocusId |chromosome|tagVariantId |statistics | +------------------+--------------------+----------+---------------+----------------------------------------------------------+ |512473136292631516|-3309244191177514893|1 |1_156645322_A_G|{null, 1.0, null, null, null, null, 1.0, null, null, null}| +------------------+--------------------+----------+---------------+----------------------------------------------------------+



There are 2 issues here:
- We are identifying as overlapping associations, even though there is not a common variant in the locus
- `statistics. right_posteriorProbability` is 1.0, which is made-up because this variant is not in the locus of the rightStudyLocusId
**Expected behaviour**
1. Overlapping associations should only be identified when there is a common variant present in both loci.
2. The `statistics.right_posteriorProbability` should accurately reflect the probability of the variant being present in both loci, and should not default to 1.0 when the variant is absent.
ireneisdoomed commented 5 months ago

I am not able to reproduce the issue today, so it must have been an error of mine. The data has been regenerated and the locus for the second association is no longer empty, plus the find_overlaps behaviour is good. I've looked at the reported pair and the overlapping variant is common in both loci, and statistics are appropriately assigned. Also checked another one 5429145510817404460/5163827416381512276:

The overlaps for these contain 36 variants, 34 of them are common to both so they have statistics at both sides, and 2 of them only in one. I've added a semantic test to StudyLocus.find_overlaps to make sure we identify changes in the logic from now on. (https://github.com/opentargets/genetics_etl_python/pull/407)