Open penghuo opened 2 years ago
Probably another way of thinking about this: the size_limit
setting is just for default behavior. If users specify a larger number by head
or LIMIT
, that means they're aware of what they're doing and just want to override the default limit value. This may be safer than setting size limit to -1 and user run a query without head command later?
Probably another way of thinking about this: the
size_limit
setting is just for default behavior. If users specify a larger number byhead
orLIMIT
, that means they're aware of what they're doing and just want to override the default limit value. This may be safer than setting size limit to -1 and user run a query without head command later?
Update the proposal as discussed.
Interface to the OpenSearch engine used by the OpenSearchIndexScan
physical plan
OpenSearchQueryRequest
The default request operator.OpenSearchScrollRequest
This is used if query size exceeds the index.max_result_window
setting. It invokes scroll requests to OpenSearch and fetches results in batches.There's no scroll request for aggregation queries in OpenSearch.
For a composite (group by) aggregation query, the response contains a keyAfter
field, which can be used in the next request to fetch the next buckets.
OpenSearchRequestBuilder
builds OpenSearchQueryRequest
or OpenSearchScrollRequest
, depending on whether scrolling is needed.
index.max_result_window
for indices.OpenSearchIndexScan
, which contains OpenSearchRequestBuilder
plan.open()
maxResultWindow
plan.close()
, clean up the cursor and context in OpenSearch engine if request type is OpenSearchScrollRequest
Remaining issues:
Here we assume
query.size_limit = 200
index.max_result_window = 10000
These work as expected:
source=index
returns 200 rowssource=index | head 1
returns 1 rowsource=index | head 300
returns 300 rowssource=index | head 11000
returns 11000 rows using scrollsource=index | fields a,b
returns 200 rowssource=index | fields a,b | head 1
returns 1 rowBut these don't:
source=index | fields a,b | head 300
returns 200 rowssource=index | fields a,b | head 11000
returns 200 rowsThe reason being that limit is only pushed down to index scan if they're optimized and merged into a single node. In these two cases the index scan has query size 200 (query.size_limit).
Better logical plan optimization so that the Project logical plan node doesn't block optimization for other plan nodes. Project isn't merged with Relation / Index Scan, and thus stops Limit from merging with Relation / Index Scan
One note on the performance. With this feature, there's no limitation on the size of the query result anymore, so it's possible that a single request-response cycle take too long and timeout.
@seankao-az @dai-chen Want to revisit the definition of plugins.query.size_limit, currently, the definition of plugins.query.size_limit is , The new engine fetches a default size of index from OpenSearch set by this setting, the default value equals to max result window in index level (10000 by default). You can change the value to any value not greater than the max result window value in index level (index.max_result_window). https://github.com/opensearch-project/sql/blob/main/docs/user/admin/settings.rst#plugins-query-size-limit
In my opinion, there are two issues:
My proposal is The query.size_limit configuration sets the maximum number of rows returned by a query. The default value is 10,000, and size_limit must be greater than 0. Note: This limit applies regardless of whether the query includes HEAD (PPL) or LIMIT (SQL).
makes sense to me. so query.size_limit can be any positive number, regardless to max_result_window.
Regarding
It is unclear how plugins.query.size_limit works.
I think we should let plugins.query.size_limit setting only decide the final result size, not size of any intermediate step.
Currently source=index | <other commands>
, if no other operation is pushed down to DSL, then <other commands>
will operate only on the 10000 (query.size_limit) results returned from the scan.
Problem statements
Currently, the query.size_limit setting configure the maximum amount of documents to be pull from OpenSearch. The default value is: 200. for example, Let's say size_limit = 200, and index has 10K docs.
source=index
source=index | head 1
source=index | head 11000
Proposal
The query.size_limit configure the maximum amount of rows returned by query. The default value is: 200. size_limit must larger than 0. If the query has head(PPL) or limit(SQL). it will override the query.size_limit setting.
Expectation of search query .
source=index
source=index | head 1
source=index | head 11000
Expectation of aggregation query.
source=index | stats request, count(*) by request
source=index | stats request, count(*) by request | head 11000