opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Missing Genetics Portal evidence on Platform evidence page #1866

Closed DSuveges closed 1 year ago

DSuveges commented 2 years ago

Let’s find associations for CWC22. If we narrow down the list by searching for Muscular dystrophy, we’ll see there is an association supported by Genetic associations. However, if we click on the field to see the evidence the evidence page is completely empty. The supporting evidence is not loaded.

This bug is a direct consequence of the data loading/overflow issue reported here: #1687 and caused by high odds ratio values. This is how the evidence looks like:

target disease OR OR_low OR_up beta beta_low beta_up pv_exp pv_mantissa min_or
ENSG00000163510 Orphanet_98473 3.87053e+33 1.71153e+28 8.75299e+38 nan nan nan -35 9.639 8.75299e+38
ENSG00000144331 Orphanet_98473 3.87053e+33 1.71153e+28 8.75299e+38 nan nan nan -35 9.639 8.75299e+38

Important to note that these odds ratio values are not too high. Well within the precision of Python, and is represented properly in the schema:

 |-- oddsRatio: double (nullable = true)
 |-- oddsRatioConfidenceIntervalLower: double (nullable = true)
 |-- oddsRatioConfidenceIntervalUpper: double (nullable = true)

So this issue will not be picked up by the evidence validation or the ETL. Also there was no problem in calculating the evidence score. It seems somehow this order of magnitude is close to the limit of integer representation of elastic search, which is 2^31. This suggests that there might be an issue how elastic is configured. But this bit needs further clarification. But the data seems to be right.

DSuveges commented 2 years ago

@cmalangone Could you please follow up on this issue? The data seems right, passes the ETL right, but it seems the evidence got lost in the elastic.

cmalangone commented 2 years ago

@DSuveges keep you posted

mbdebian commented 2 years ago

@DSuveges , it looks like this is still an open issue. I was wondering which final approach was chosen: either to tackle it from the data point of view, as opentargets/issues#1687 suggests, or from the backend / frontend (software) point of view.

DSuveges commented 2 years ago

@mbdebian Yes, the issue is still there. We haven't made any step to address it at the data level. I think it is reasonable assume that a number in the order of 1e-/+50 is representable. However it was just my assumption that the problem is the over/underflow at graphql. It requires further investigation to validate this hypothesis.

mbdebian commented 2 years ago

Is there any update on this? It looks like this issue is lingering in our backlog, and we may have to just close it.

mbdebian commented 2 years ago

@DSuveges , would you know whether there's any update on this? May I close this issue?

JarrodBaker commented 1 year ago

The issue we have is that we aren't specifying an index schema for data ingestion, so ES makes a best guess as to the shape of the data. As almost all of the entries fit within the range of a float (32-bit) it is creating a field with that value.

We can inspect the index settings with the query <es>/evidence_datasource_ot_genetics_portal/_mapping/field/odds* which shows:

{
  "evidence_datasource_ot_genetics_portal" : {
    "mappings" : {
      "oddsRatio" : {
        "full_name" : "oddsRatio",
        "mapping" : {
          "oddsRatio" : {
            "type" : "float"
          }
        }
      },
      "oddsRatioConfidenceIntervalUpper" : {
        "full_name" : "oddsRatioConfidenceIntervalUpper",
        "mapping" : {
          "oddsRatioConfidenceIntervalUpper" : {
            "type" : "float"
          }
        }
      },
      "oddsRatioConfidenceIntervalLower" : {
        "full_name" : "oddsRatioConfidenceIntervalLower",
        "mapping" : {
          "oddsRatioConfidenceIntervalLower" : {
            "type" : "float"
          }
        }
      }
    }
  }
}

We need the type on these fields to be double as specified in the number documentation.

When those fields are configured as double insert using the data in @DSuveges' example works correctly:

PUT test_evidence_gen/
{
  "mappings": {
    "properties": {
      "oddsRatio": {
        "type": "double"
      },
      "oddsRatioConfidenceIntervalUpper": {
        "type": "double"
      },
      "oddsRatioConfidenceIntervalLower": {
        "type": "double"
      }
    }
  }
}

POST /test_evidence_gen/_doc/
{
  "oddsRatio": "3.87053e+33",
  "oddsRatioConfidenceIntervalUpper": "1.71153e+28", 
  "oddsRatioConfidenceIntervalLower": "8.75299e+38"
}

GET test_evidence_gen/_search
{
  "query": {
    "match_all": {}
  }
}

The document is created as expected and is retrievable.

I'm making a PR now @mbdebian to resolve this for the next release.