opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.46k stars 1.74k forks source link

Random arithmetic_exception - "long overflow" errors #5713

Open soltmar opened 1 year ago

soltmar commented 1 year ago

Hi,

I'm receiving following errors randomly in OpenSearch response:

"_shards" : {
    "total" : 5,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 4,
    "failures" : [
      {
        "shard" : 3,
        "index" : "my_index",
        "node" : "Ed3vRYdSRdaHyhkzhtuXyQ",
        "reason" : {
          "type" : "arithmetic_exception",
          "reason" : "long overflow"
        }
      }
    ]
  },
...

With below sort request:

GET my_index/_search
{
        "track_total_hits": true,
        "query":
        {
            "bool":
            {
                "must_not":
                [
                    {
                        "term":
                        {
                            "status_id": 8
                        }
                    }
                ]
            }
        },
        "size": "50",
        "sort":
        {
            "reminder_date": {
              "missing": "_last",
              "order": "asc"
            }
        }
}

To give it a little bit of context:

When running above query sometimes OpenSearch returns no errors, but also it sometimes fails with : "arithmetic_exception".

Running the same query, almost every time gives different number of results and failing shards.

Let me know if you need any further info. Thanks

dblock commented 1 year ago

Do you have an error stack in the logs? A complete error response?

soltmar commented 1 year ago

I'm using OpenSearch on AWS. I have error logs enabled (Sent to CloudWatch) but there are no logs related to that search query.

Are you aware of any way to get these logs on AWS ? Or maybe in the query result itself ?

Btw. I'm always getting some hits back but not full number of them when "failures" key is present. Also, when missing is set to _first in sort element it all works fine so look like it's only related to _last flag. I did notice that sort element on results where reminder_date is null got below value (not sure if it helps or not):

....
"sort" : [
          -9223372036854775808
        ],
...

When it is ok I got this:

{
  "took" : 103,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 74090,
      "relation" : "eq"
    },
...
}

And this when some shards are failing:

{
  "took" : 53,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 3,
    "failures" : [
      {
        "shard" : 1,
        "index" : "my_index",
        "node" : "YCEI0PpITD6h-xlV3udp-g",
        "reason" : {
          "type" : "arithmetic_exception",
          "reason" : "long overflow"
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 29482,
      "relation" : "eq"
    },
    ....
   }
}
soltmar commented 1 year ago

I think I may have something. Since I did set allow_partial_search_results=false I've started to receive logs:

Caused by: java.lang.ArithmeticException: long overflow
    at __PATH__(Math.java:949)
    at __PATH__(Math.java:925)
    at __PATH__(Instant.java:1236)
    at org.opensearch.index.mapper.DateFieldMapper$Resolution$1.convert(DateFieldMapper.java:106)
    at org.opensearch.index.mapper.DateFieldMapper$DateFieldType.parseToLong(DateFieldMapper.java:510)
    at org.opensearch.index.mapper.DateFieldMapper$DateFieldType.isFieldWithinQuery(DateFieldMapper.java:548)
    at org.opensearch.search.sort.FieldSortBuilder.isBottomSortShardDisjoint(FieldSortBuilder.java:481)
    at org.opensearch.search.internal.ShardSearchRequest$RequestRewritable.rewrite(ShardSearchRequest.java:549)
    at org.opensearch.search.internal.ShardSearchRequest$RequestRewritable.rewrite(ShardSearchRequest.java:531)
    at org.opensearch.index.query.Rewriteable.rewrite(Rewriteable.java:83)
    at org.opensearch.search.SearchService.canMatch(SearchService.java:1323)
    at org.opensearch.search.SearchService$2.onResponse(SearchService.java:472)
    ... 121 more

I have also found this on ES https://github.com/elastic/elasticsearch/issues/52396 I think it may be similar problem

dblock commented 1 year ago

That's definitely something! I would continue narrowing down to a 100% repro, maybe on a local instance with the same mapping? And at the same time would try to write a unit test that calls whatever is in DateFieldMapper.java:106 with that value you see in the sort.