opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.67k stars 1.78k forks source link

[BUG] Slow query logs contain null date range #2054

Open shatp opened 2 years ago

shatp commented 2 years ago

Describe the bug Looking at the slow query logs from OpenSearch, we see that the range -> from/to values are null when queries are run from Quick select of Discover Tab.

To Reproduce Steps to reproduce the behavior:

  1. I replicated the setup with search slow logs with version Amazon ES7.9. [ issue is with all versions]
  2. Created an index with slow logs threshold defined as 0s (this setting is only for testing purposes, not to be configured in production clusters):
    PUT my_index/_settings
    {
    "index.search.slowlog.threshold.query.warn": "0s",
    "index.search.slowlog.threshold.query.info":"0s",
    }
  3. After searching from Discover tab for index pattern "m*", the slow log was logged as follows:
    
    [2022-01-22T00:13:10,910][WARN ][index.search.slowlog.query] [---] [my_index][4] took[643.1micros], took_millis[0], total_hits[1 hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[5], source[
    {
    "size": 500,
    "timeout": "60000ms",
    "query": {
        "bool": {
            "must": [
                {
                    "match_all": {
                        "boost": 1
                    }
                }
            ],
            "filter": [
                {
                    "range": {
                        "@timestamp": {
                            "from": null,
                            "to": null,
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
4. when I add it as a filter in Discover tab, the value populates. 

"filter": [ { "range": { "@timestamp": { "from": "2022-02-02T16:03:10.792+05:30", "to": "2022-02-03T16:03:10.792+05:30", "include_lower": true, "include_upper": false, "boost": 1 } } },


<img width="1414" alt="DiscoverTab" src="https://user-images.githubusercontent.com/98999784/152470687-73c8f844-c122-41d1-bf0a-72ef78de4305.png">

**Expected behavior**
The timestamp should be non-null for both Quick select and filter ranges in Kibana queries.
assafcoh commented 11 months ago

I am experiencing the same issue on opensearch version 2.9. For example when running this query in opensearch dashboard:

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "tenant_id": 1
          }
        },
        {
          "term": {
            "user_id": "user-34"
          }
        },
        {
          "range": {
            "event_time_utc_ms": {
              "gte": 1694171042000
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "dst_type": {
      "terms": {
        "field": "dst_type"
      }
    },
    "dst_category_name": {
      "terms": {
        "field": "dst_category_name"
      }
    }
  }
}

I see this in the slow logs. Notice the null range filter:

"query": {
        "bool": {
            "filter": [
                {
                    "term": {
                        "tenant_id": {
                            "value": 1,
                            "boost": 1
                        }
                    }
                },
                {
                    "term": {
                        "user_id": {
                            "value": "user-34",
                            "boost": 1
                        }
                    }
                },
                {
                    "range": {
                        "event_time_utc_ms": {
                            "from": null,
                            "to": null,
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    }
msfroh commented 11 months ago

So, the problem seems to arise from the fact that slow logs are written at the shard level and queries can be rewritten at the shard level.

Specifically, there is logic in RangeQueryBuilder shard-level rewrite that checks if the current shard is entirely contained within the range. If so, it replaces the upper and lower bounds with null, which functionally turns it into a "does the field exist?" query: https://github.com/opensearch-project/OpenSearch/blob/966b78dae39fd3538d78f05bdc2b5fbdc1e1301f/server/src/main/java/org/opensearch/index/query/RangeQueryBuilder.java#L486-L493

I can see a few possible options, all of which require changes in OpenSearch (i.e. I see no user workaround):

  1. Save the original query (before rewrite) and write that to the slow logs. (There are other potentially surprising rewrites -- e.g. if a field isn't part of a targeted shard's index mapping, the query is rewritten to match_none.)
  2. Fiddle with the logic specific to RangeQueryBuilder to keep the endpoints, but still allow it to execute as a FieldExistsQuery. (I would probably add a flag to RangeQueryBuilder that only gets set on rewrite, and essentially says "run as a FieldExistsQuery".)
  3. Implement request-level slow logs, as described in https://github.com/opensearch-project/OpenSearch/issues/9642.