opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
121 stars 140 forks source link

Extend size_limit setting in query engine to support unlimited index query. #703

Open penghuo opened 2 years ago

penghuo commented 2 years ago

Problem statements

Currently, the query.size_limit setting configure the maximum amount of documents to be pull from OpenSearch. The default value is: 200. for example, Let's say size_limit = 200, and index has 10K docs.

Proposal

The query.size_limit configure the maximum amount of rows returned by query. The default value is: 200. size_limit must larger than 0. If the query has head(PPL) or limit(SQL). it will override the query.size_limit setting.

Expectation of search query .

Expectation of aggregation query.

dai-chen commented 2 years ago

Probably another way of thinking about this: the size_limit setting is just for default behavior. If users specify a larger number by head or LIMIT, that means they're aware of what they're doing and just want to override the default limit value. This may be safer than setting size limit to -1 and user run a query without head command later?

penghuo commented 2 years ago

Probably another way of thinking about this: the size_limit setting is just for default behavior. If users specify a larger number by head or LIMIT, that means they're aware of what they're doing and just want to override the default limit value. This may be safer than setting size limit to -1 and user run a query without head command later?

Update the proposal as discussed.

seankao-az commented 2 years ago

Design

OpenSearch Request

Request Operators

Non Aggregation Query

Interface to the OpenSearch engine used by the OpenSearchIndexScan physical plan

  1. OpenSearchQueryRequest The default request operator.
  2. OpenSearchScrollRequest This is used if query size exceeds the index.max_result_window setting. It invokes scroll requests to OpenSearch and fetches results in batches.
Aggregation Query

There's no scroll request for aggregation queries in OpenSearch. For a composite (group by) aggregation query, the response contains a keyAfter field, which can be used in the next request to fetch the next buckets.

Request Builder

OpenSearchRequestBuilder builds OpenSearchQueryRequest or OpenSearchScrollRequest, depending on whether scrolling is needed.

Physical Plan Implementation

  1. Get index.max_result_window for indices.
  2. Initializes OpenSearchIndexScan, which contains OpenSearchRequestBuilder
  3. Visit logical plan with index scan as context, so logical operators visited will accumulate (push down) OpenSearch query and aggregation DSL on index scan. The operations are pushed down to the request builder.

Index Scan Execution

  1. Build the request upon plan.open()
  2. Fetch the results in batches of size maxResultWindow
  3. When plan.close(), clean up the cursor and context in OpenSearch engine if request type is OpenSearchScrollRequest
seankao-az commented 2 years ago

Remaining issues:

  1. Extend query size limit for aggregation query requests
  2. Described as follows

Here we assume

query.size_limit = 200
index.max_result_window = 10000

These work as expected:

But these don't:

The reason being that limit is only pushed down to index scan if they're optimized and merged into a single node. In these two cases the index scan has query size 200 (query.size_limit).

Solution

Option 1

Better logical plan optimization so that the Project logical plan node doesn't block optimization for other plan nodes. Project isn't merged with Relation / Index Scan, and thus stops Limit from merging with Relation / Index Scan

seankao-az commented 2 years ago

One note on the performance. With this feature, there's no limitation on the size of the query result anymore, so it's possible that a single request-response cycle take too long and timeout.

penghuo commented 1 month ago

@seankao-az @dai-chen Want to revisit the definition of plugins.query.size_limit, currently, the definition of plugins.query.size_limit is , The new engine fetches a default size of index from OpenSearch set by this setting, the default value equals to max result window in index level (10000 by default). You can change the value to any value not greater than the max result window value in index level (index.max_result_window). https://github.com/opensearch-project/sql/blob/main/docs/user/admin/settings.rst#plugins-query-size-limit

In my opinion, there are two issues:

  1. It is unclear how plugins.query.size_limit works.
  2. It should not be tied to the max_result_window.

My proposal is The query.size_limit configuration sets the maximum number of rows returned by a query. The default value is 10,000, and size_limit must be greater than 0. Note: This limit applies regardless of whether the query includes HEAD (PPL) or LIMIT (SQL).

seankao-az commented 4 weeks ago

makes sense to me. so query.size_limit can be any positive number, regardless to max_result_window.

Regarding

It is unclear how plugins.query.size_limit works.

I think we should let plugins.query.size_limit setting only decide the final result size, not size of any intermediate step. Currently source=index | <other commands>, if no other operation is pushed down to DSL, then <other commands> will operate only on the 10000 (query.size_limit) results returned from the scan.