Closed DSuveges closed 1 year ago
@cmalangone Could you please follow up on this issue? The data seems right, passes the ETL right, but it seems the evidence got lost in the elastic.
@DSuveges keep you posted
@DSuveges , it looks like this is still an open issue. I was wondering which final approach was chosen: either to tackle it from the data point of view, as opentargets/issues#1687 suggests, or from the backend / frontend (software) point of view.
@mbdebian Yes, the issue is still there. We haven't made any step to address it at the data level. I think it is reasonable assume that a number in the order of 1e-/+50 is representable. However it was just my assumption that the problem is the over/underflow at graphql. It requires further investigation to validate this hypothesis.
Is there any update on this? It looks like this issue is lingering in our backlog, and we may have to just close it.
@DSuveges , would you know whether there's any update on this? May I close this issue?
The issue we have is that we aren't specifying an index schema for data ingestion, so ES makes a best guess as to the shape of the data. As almost all of the entries fit within the range of a float (32-bit) it is creating a field with that value.
We can inspect the index settings with the query <es>/evidence_datasource_ot_genetics_portal/_mapping/field/odds*
which shows:
{
"evidence_datasource_ot_genetics_portal" : {
"mappings" : {
"oddsRatio" : {
"full_name" : "oddsRatio",
"mapping" : {
"oddsRatio" : {
"type" : "float"
}
}
},
"oddsRatioConfidenceIntervalUpper" : {
"full_name" : "oddsRatioConfidenceIntervalUpper",
"mapping" : {
"oddsRatioConfidenceIntervalUpper" : {
"type" : "float"
}
}
},
"oddsRatioConfidenceIntervalLower" : {
"full_name" : "oddsRatioConfidenceIntervalLower",
"mapping" : {
"oddsRatioConfidenceIntervalLower" : {
"type" : "float"
}
}
}
}
}
}
We need the type
on these fields to be double
as specified in the number documentation.
When those fields are configured as double
insert using the data in @DSuveges' example works correctly:
PUT test_evidence_gen/
{
"mappings": {
"properties": {
"oddsRatio": {
"type": "double"
},
"oddsRatioConfidenceIntervalUpper": {
"type": "double"
},
"oddsRatioConfidenceIntervalLower": {
"type": "double"
}
}
}
}
POST /test_evidence_gen/_doc/
{
"oddsRatio": "3.87053e+33",
"oddsRatioConfidenceIntervalUpper": "1.71153e+28",
"oddsRatioConfidenceIntervalLower": "8.75299e+38"
}
GET test_evidence_gen/_search
{
"query": {
"match_all": {}
}
}
The document is created as expected and is retrievable.
I'm making a PR now @mbdebian to resolve this for the next release.
Let’s find associations for CWC22. If we narrow down the list by searching for Muscular dystrophy, we’ll see there is an association supported by Genetic associations. However, if we click on the field to see the evidence the evidence page is completely empty. The supporting evidence is not loaded.
This bug is a direct consequence of the data loading/overflow issue reported here: #1687 and caused by high odds ratio values. This is how the evidence looks like:
Important to note that these odds ratio values are not too high. Well within the precision of Python, and is represented properly in the schema:
So this issue will not be picked up by the evidence validation or the ETL. Also there was no problem in calculating the evidence score. It seems somehow this order of magnitude is close to the limit of integer representation of elastic search, which is 2^31. This suggests that there might be an issue how elastic is configured. But this bit needs further clarification. But the data seems to be right.