[RFC] Push Down Sort/Limit/Project through LogicalEval

Is your feature request related to a problem? Currently in a logical plan of OpenSearch, the LogicalEval is a blocker for many push-down optimization rules like TableScanPushDown.PUSH_DOWN_PROJECT, TableScanPushDown.PUSH_DOWN_SORT, TableScanPushDown.PUSH_DOWN_LIMIT and etc. Not only does it reduce the performance of execution, it also lead to wrong results if we cannot do pushdown optimization.

E.g. Below 2 PPL are semantic equal, but the latter one with eval retrieves all fields from OpenSearch(which is unnecessary), while not enough documents due to the limitation of settings of plugins.query.size_limit.

# no eval
POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights |  head 300 | fields FlightTimeMin
  """
}
# explain result
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "OpenSearchIndexScan",
        "description": {
          "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":200,"timeout":"1m","_source":{"includes":["FlightTimeMin"],"excludes":[]}}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

--------------------------------------------------------

# with eval
POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights | eval FlightMin = FlightTimeMin |  head 300 | fields FlightTimeMin
  """
}
# explain result
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "LimitOperator",
        "description": {
          "limit": 10,
          "offset": 0
        },
        "children": [
          {
            "name": "EvalOperator",
            "description": {
              "expressions": {
                "FlightMin": "FlightTimeMin"
              }
            },
            "children": [
              {
                "name": "OpenSearchIndexScan",
                "description": {
                  "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":300,"timeout":"1m"}, searchDone=false)"""
                },
                "children": []
              }
            ]
          }
        ]
      }
    ]
  }
}

What solution would you like? Since eval operator only adds or updates fields without any benefit on skipping or project data, and it could block other optimization, we should push down sort/limit/project operator through eval to evaluate eval's output as late as possible.

1. Limit

For limit, we could push down it under eval directly since it doesn't have any expressions.

2. Sort

For sort, for each sort field, we should analyze whether it's produced by eval.

2.1 All sort fields are not produced by eval, which means they can all be resolved by the Environment before eval, so we can push down it under eval directly

2.2 One or more fields are produced by eval, there are also some different cases we can dive deep whether replacement can work as semantic equally like sort(a) - eval(a=b+1), or other cases it cannot like sort(a) - eval(a=b^2). And there may be cases sort field are calculated by more than 1 original fields. I think we should talk about this case in another issue for advance enhancement.

2.2.1 the field is produced by equation to ReferenceExpression which is just simple rename like eval(sortField=originField), we can also push down sort by do some field replacement for these fields. Of course, there is some more complex cases like eval(newField=originField, sortField=newField), we should do replacement iteratively from right to left, or we just narrow down this case to not let it push down.
2.2.2 the field is produced by function, there are also some different cases we can dive deep whether replacement can work as semantic equally like sort(a) - eval(a=b+1), or other cases it cannot like sort(a) - eval(a=b^2). And there may be cases sort field are calculated by more than 1 original fields. I think we should talk about this case in another issue for advance enhancement.

3. Project

When we say pushing down a project under eval, it actually inserts another project before eval instead of pushing down the origin project itself.

We should consruct the new projectList by append all eval reference expressions to the original projectList, and kick off all reference expressions produced by eval. For example, if we have plan project(a, b, d) - eval(b=c+2, a=b+1), we should optimize it to project(a, b, c) - eval(a=b+1, b=c+2)) - project(c, d). Thus, we can do project in advance or push down project(c, d) into TableScan if possible, and ensure they are semantic equal.

Sub-Tasks

[x] #2903
[ ] #2904
[ ] #2905

What alternatives have you considered?

Do you have any additional context?

opensearch-project / sql