opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
118 stars 138 forks source link

[RFC] Push Down Sort/Limit/Project through LogicalEval #2864

Open qianheng-aws opened 2 months ago

qianheng-aws commented 2 months ago

Is your feature request related to a problem? Currently in a logical plan of OpenSearch, the LogicalEval is a blocker for many push-down optimization rules like TableScanPushDown.PUSH_DOWN_PROJECT, TableScanPushDown.PUSH_DOWN_SORT, TableScanPushDown.PUSH_DOWN_LIMIT and etc. Not only does it reduce the performance of execution, it also lead to wrong results if we cannot do pushdown optimization.

E.g. Below 2 PPL are semantic equal, but the latter one with eval retrieves all fields from OpenSearch(which is unnecessary), while not enough documents due to the limitation of settings of plugins.query.size_limit.

# no eval
POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights |  head 300 | fields FlightTimeMin
  """
}
# explain result
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "OpenSearchIndexScan",
        "description": {
          "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":200,"timeout":"1m","_source":{"includes":["FlightTimeMin"],"excludes":[]}}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

--------------------------------------------------------

# with eval
POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights | eval FlightMin = FlightTimeMin |  head 300 | fields FlightTimeMin
  """
}
# explain result
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "LimitOperator",
        "description": {
          "limit": 10,
          "offset": 0
        },
        "children": [
          {
            "name": "EvalOperator",
            "description": {
              "expressions": {
                "FlightMin": "FlightTimeMin"
              }
            },
            "children": [
              {
                "name": "OpenSearchIndexScan",
                "description": {
                  "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":300,"timeout":"1m"}, searchDone=false)"""
                },
                "children": []
              }
            ]
          }
        ]
      }
    ]
  }
}

What solution would you like? Since eval operator only adds or updates fields without any benefit on skipping or project data, and it could block other optimization, we should push down sort/limit/project operator through eval to evaluate eval's output as late as possible.

1. Limit

For limit, we could push down it under eval directly since it doesn't have any expressions.

2. Sort

For sort, for each sort field, we should analyze whether it's produced by eval.

2.1 All sort fields are not produced by eval, which means they can all be resolved by the Environment before eval, so we can push down it under eval directly

2.2 One or more fields are produced by eval, there are also some different cases we can dive deep whether replacement can work as semantic equally like sort(a) - eval(a=b+1), or other cases it cannot like sort(a) - eval(a=b^2). And there may be cases sort field are calculated by more than 1 original fields. I think we should talk about this case in another issue for advance enhancement.

3. Project

When we say pushing down a project under eval, it actually inserts another project before eval instead of pushing down the origin project itself.

We should consruct the new projectList by append all eval reference expressions to the original projectList, and kick off all reference expressions produced by eval. For example, if we have plan project(a, b, d) - eval(b=c+2, a=b+1), we should optimize it to project(a, b, c) - eval(a=b+1, b=c+2)) - project(c, d). Thus, we can do project in advance or push down project(c, d) into TableScan if possible, and ensure they are semantic equal.

Sub-Tasks

What alternatives have you considered?

Do you have any additional context?

LantaoJin commented 2 months ago

Not all scenarios for operators sort/limit/project in a query are able to push down through eval. Could you explain more details of your design in this description? @qianheng-aws