opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
120 stars 139 forks source link

[BUG] IS NOT NULL with alias throws SemanticCheckException #292

Open ankurgupta4891 opened 2 years ago

ankurgupta4891 commented 2 years ago

Describe the bug Using IS NOT NULL with alias throws SemanticCheckException.

To Reproduce Sample query:

SELECT dest_city_name AS dc
FROM all_flights
WHERE dc IS NOT NULL

Expected behavior Should be same as query:

SELECT dest_city_name AS dc
FROM all_flights
WHERE dest_city_name IS NOT NULL
penghuo commented 2 years ago

Thanks for reporting issue. Currently we doesn't support use alias in where clause. One workaround should be

SELECT dest_city_name AS dc
FROM all_flights
WHERE dest_city_name IS NOT NULL
penghuo commented 2 years ago

@ankurgupta4891 In https://github.com/opendistro-for-elasticsearch/sql/issues/1145, you mentioned, "I tried to give simple version of our use case but we need to get alias support with IS NOT NULL"

Could you share your use case? Are u using tools to auto generate query?

penghuo commented 2 years ago

confirmed offline. Query is auto generated.

Next Step

We need to evaluate the tech solution. The challenge is that SQL statements is analyzed in predefined order of, FROM -> WHERE -> SELECT, alias defined in SELECT which is not visible to WHERE clause.

jingyali-apr commented 2 years ago

Hey team, I notice that IS NOT NULL/IS NULL with alias in ORDER BY also throws exception: "This query is not explainable." (Using Query Workbench kibana plugin in ES 7.10)

SELECT dest_state_nm AS city FROM all_flights_integ_tests ORDER BY city IS NOT NULL

Could you help and check as well? Thanks

dai-chen commented 2 years ago

Hey team, I notice that IS NOT NULL/IS NULL with alias in ORDER BY also throws exception: "This query is not explainable." (Using Query Workbench kibana plugin in ES 7.10)

SELECT dest_state_nm AS city FROM all_flights_integ_tests ORDER BY city IS NOT NULL

Could you help and check as well? Thanks

@jingyali-apr Thanks for reporting the issue! Could you share your index mapping and some test data for investigation? And it would be helpful if you can find error stacktrace in ES log.

jingyali-apr commented 2 years ago

Hey team, I notice that IS NOT NULL/IS NULL with alias in ORDER BY also throws exception: "This query is not explainable." (Using Query Workbench kibana plugin in ES 7.10)

SELECT dest_state_nm AS city FROM all_flights_integ_tests ORDER BY city IS NOT NULL

Could you help and check as well? Thanks

@jingyali-apr Thanks for reporting the issue! Could you share your index mapping and some test data for investigation? And it would be helpful if you can find error stacktrace in ES log.

Hey @dai-chen, please check below samples.

Successful case without IS NOT NULL

POST _opendistro/_sql/
{
"query": "SELECT dest_state_nm AS state FROM all_flights_integ_tests GROUP BY state ORDER BY state limit 5"
}

{
  "schema": [
    {
      "name": "dest_state_nm",
      "alias": "state",
      "type": "keyword"
    }
  ],
  "datarows": [
    [
      "Alabama"
    ],
    [
      "Alaska"
    ],
    [
      "California"
    ],
    [
      "Colorado"
    ],
    [
      "Connecticut"
    ]
  ],
  "total": 5,
  "size": 5,
  "status": 200
}

Failed case with IS NOT NULL

POST _opendistro/_sql/
{
"query": "SELECT dest_state_nm AS state FROM all_flights_integ_tests GROUP BY state ORDER BY state IS NOT NULL limit 5"
}

{
  "error": {
    "reason": "Invalid SQL query",
    "details": "can't resolve Symbol(namespace=FIELD_NAME, name=state) in type env",
    "type": "SemanticCheckException"
  },
  "status": 400
}

Successful case without using alias

POST _opendistro/_sql/
{
"query": "SELECT dest_state_nm AS state FROM all_flights_integ_tests GROUP BY state ORDER BY dest_state_nm IS NOT NULL limit 5"
}

{
  "schema": [
    {
      "name": "dest_state_nm",
      "alias": "state",
      "type": "keyword"
    }
  ],
  "datarows": [
    [
      "Alabama"
    ],
    [
      "Wisconsin"
    ],
    [
      "West Virginia"
    ],
    [
      "Virginia"
    ],
    [
      "Vermont"
    ]
  ],
  "total": 5,
  "size": 5,
  "status": 200
}

Partial test data:

POST _opendistro/_sql/
{
"query": "SELECT dest_state_nm FROM all_flights_integ_tests limit 50"
}

{
  "schema": [
    {
      "name": "dest_state_nm",
      "type": "keyword"
    }
  ],
  "datarows": [
    [
      "Alaska"
    ],
    [
      "South Carolina"
    ],
    [
      "Michigan"
    ],
    [
      "Texas"
    ],
    [
      "Texas"
    ],
    [
      "New York"
    ],
    [
      "Michigan"
    ],
    [
      "Massachusetts"
    ],
    [
      "Texas"
    ],
    [
      "Hawaii"
    ],
    [
      "Hawaii"
    ],
    [
      "Alabama"
    ],
    [
      "Texas"
    ],
    [
      "Georgia"
    ],
    [
      "Ohio"
    ],
    [
      "Michigan"
    ],
    [
      "Texas"
    ],
    [
      "Ohio"
    ],
    [
      "New York"
    ],
    [
      "Virginia"
    ],
    [
      "Virginia"
    ],
    [
      "Michigan"
    ],
    [
      "Florida"
    ],
    [
      "Wisconsin"
    ],
    [
      "Texas"
    ],
    [
      "Mississippi"
    ],
    [
      "New York"
    ],
    [
      "Georgia"
    ],
    [
      "Texas"
    ],
    [
      "New Jersey"
    ],
    [
      "Texas"
    ],
    [
      "Louisiana"
    ],
    [
      "New York"
    ],
    [
      "Alabama"
    ],
    [
      "Texas"
    ],
    [
      "Kentucky"
    ],
    [
      "Texas"
    ],
    [
      "Georgia"
    ],
    [
      "Georgia"
    ],
    [
      "Texas"
    ],
    [
      "Georgia"
    ],
    [
      "Tennessee"
    ],
    [
      "New Jersey"
    ],
    [
      "Georgia"
    ],
    [
      "Georgia"
    ],
    [
      "Georgia"
    ],
    [
      "Georgia"
    ],
    [
      "Maryland"
    ],
    [
      "Colorado"
    ],
    [
      "Tennessee"
    ]
  ],
  "total": 50,
  "size": 50,
  "status": 200
}

Thanks

dai-chen commented 2 years ago

@jingyali-apr Could you try to use the original field name in ORDER BY clause as below?

POST _opendistro/_sql
{
"query" : """
  SELECT dest_state_nm AS state FROM all_flights_integ_tests GROUP BY state ORDER BY dest_state_nm IS NOT NULL
 """
}
jingyali-apr commented 2 years ago

@jingyali-apr Could you try to use the original field name in ORDER BY clause as below?

POST _opendistro/_sql
{
"query" : """
  SELECT dest_state_nm AS state FROM all_flights_integ_tests GROUP BY state ORDER BY dest_state_nm IS NOT NULL
 """
}

Yeah, I tried this and it worked.

See my Successful case without using alias in above comment.

dai-chen commented 2 years ago

@jingyali-apr Thanks for confirming! I may have found where things go wrong: https://github.com/opensearch-project/sql/blob/907019cc8f7ce05c34ed453a75e4d897f55b57c5/sql/src/main/java/org/opensearch/sql/sql/parser/AstSortBuilder.java#L53

When we try to replace alias in ORDER BY, we only replace ordinal or alias directly. In our case, state is NOT NULL is an expression which causes alias replacement fail. Will debug to confirm and open issue/PR to fix. However, just to let you know, the fix would be only available in latest OpenSearch version (probably 2.3 next). Thanks!

Yury-Fridlyand commented 2 years ago

https://stackoverflow.com/a/942592 You can only use column aliases in GROUP BY, ORDER BY, or HAVING clauses.

Standard SQL doesn't allow you to refer to a column alias in a WHERE clause. This restriction is imposed because when the WHERE code is executed, the column value may not yet be determined.

Copied from MySQL documentation

jpluimers commented 1 year ago

Copied from MySQL documentation

Note that was the [Wayback/Archive] MySQL :: MySQL 5.1 Reference Manual :: B.5.5.4 Problems with Column Aliases documentation until December 2009. After that it got rephrased (including "doesn't allow" to "disallow", "alias" to "aliases", "code" to "clause) and still is in the current form at [Wayback/Archive] MySQL :: MySQL 8.0 Reference Manual :: B.3.4.4 Problems with Column Aliases:

Standard SQL disallows references to column aliases in a WHERE clause. This restriction is imposed because when the WHERE clause is evaluated, the column value may not yet have been determined.