Closed GumpacG closed 2 weeks ago
We had a team discussion outside of GH and I'll list ideas we got and notes for them.
array<type>
and type
TL;DR Not applicable. Imagine you have this fix done and you repeat experiment given above.
dbg
has one doc, SQL returns type long
, but when second doc posted to the index, SQL returns array
.long
, but when you query for the second (or both), SQL returns array
.The user didn't change index/mapping/field/column type/etc, but data type was changed, it is unacceptable.
array
TL;DR Not applicable.
A user would lose information about data type of a column/field.
It could be a part of the url, e.g. localhost:9200/_plugins/_sql?param=value
or part of json body, e.g.
{
"query": "select * from dbg",
"param": "value"
}
In that case user would be able to specify whether it wants a loose response (like legacy does now) or strict. TBD what to do with array values in that case? Replace by nulls? Don't return?
CAST
Enforce strict
mode. A user should do cast to get array values. Without cast
they should be TBD (omitted? null
ed?).
Example:
SELECT CAST(myNum AS ARRAY), myNum FROM dbg;
Cast to array should be implemented though.
PartiQL
Ref link: https://partiql.org/tutorial.html#variables-can-range-over-data-with-different-types
Enforce strict
mode. A user should specify how to interpret values using combination of CASE
and IS
clauses. TBD what to do with value if no instructions given (omit? null
? error?).
SELECT CASE WHEN (myNum IS ARRAY) THEN p[0]
ELSE p END AS myNum
FROM dbg;
or
SELECT num AS myNum FROM dbg as d, d.myNum[0] as num
Ideas are welcome!
1. In any case a user should be able to understand that there is a complex value in the index. Consider updating response for DESCRIBE
query.
2. If complex value produced by a function (e.g. nested
) and can't be inserted into response (for example, CAST
or CASE
is missing), an error should be raised. It should unambiguously say what is missing in the query.
@Yury-Fridlyand Thanks for sharing the notes! Have we considered also the idea in Preto/Trino: https://github.com/opensearch-project/sql/issues/442#issuecomment-1246068445 ?
Following that the response be like
{
"schema": [
{
"name": "myNum",
"type": "long",
"array": true
}
],
"total": 2,
"datarows": [
[
5
],
[
[
3,
4
]
]
],
"size": 2,
"status": 200
}
Why not?
It could be implemented along with CAST
/PartiQL
approach. Keep in mind it is a breaking change.
This also concerns fields being passed in to functions as only the last value in the array is used. Also, note the incorrect type with the example below.
For example:
Query: SELECT upper(name), name FROM string_array
Output:
{
"schema": [
{
"name": "upper(name)",
"type": "double"
},
{
"name": "name",
"type": "keyword"
}
],
"total": 5,
"datarows": [
[
"JOE",
"joe"
],
[
"STEVE",
"steve"
],
[
"BOB",
"bob"
],
[
"ANDY",
"andy"
],
[
"ADAM",
[
"david",
"adam"
]
]
],
"size": 5,
"status": 200
}
A current blocker for implementing Array Support in JDBC driver, so this would be a great feature to have, any status update on priority?
I'm going to research possible solutions for this and share for discussion. Meanwhile I found more tricky sample, which we should consider as well.
POST dbg/_doc
{
"obj": [[1, 2], [3, 4], 5]
}
search result:
{
"took" : 361,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "dbg",
"_id" : "uDBnoYgB5NyEnr3HyanK",
"_score" : 1.0,
"_source" : {
"obj" : [
[
1,
2
],
[
3,
4
],
5
]
}
}
]
}
}
mapping:
{
"dbg" : {
"aliases" : { },
"mappings" : {
"properties" : {
"obj" : {
"type" : "long"
}
}
},
"settings" : {
"index" : {
"creation_date" : "1686168574104",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "LN-lYUQFSsi9MVpa5VT-Zg",
"version" : {
"created" : "136297827"
},
"provided_name" : "dbg"
}
}
}
}
Please proceed with discussion in https://github.com/opensearch-project/sql/discussions/1733.
Supporting all of the above use cases will take multiple tries, and each should be dealt with separately. We can separate each use cases into a individual issues.
To solve the primitive/array expanding into multiple rows, we should try and use the metadata (cues) to determine if the data is treated as a primitive object or an array (we cannot do both easily). Doing something like what Presto/Trino supports in the index mapping: https://trino.io/docs/current/connector/elasticsearch.html#array-types would be simple. It would indicate that the mapped symbol should treat the data record as an array (or an array of 1 if the data is not defined as an array).
We should PoC this and see if it works to solve https://github.com/opensearch-project/sql/discussions/1733#discussioncomment-6157898.
What is the bug?
The new engine does not return values in an array while the Legacy engine returns all values in a row as an array. Implementing same support as
V1
does isn't a right way, because legacy engine produces inconsistent value.How can one reproduce the bug?
Steps to reproduce the behavior:
curl -XDELETE 'http://localhost:9200/dbg'
curl -X POST "localhost:9200/dbg/_doc/?pretty" -H 'Content-Type: application/json' -d '{"myNum": 5}'
select * from dbg
. Not bas so far.curl -X POST "localhost:9200/dbg/_doc/?pretty" -H 'Content-Type: application/json' -d '{"myNum": [3, 4]}'
curl -X GET "localhost:9200/dbg?pretty"
curl -s -XPOST http://localhost:9200/_plugins/_sql -H 'Content-Type: application/json' -d '{"query": "select * from dbg"}'
(if you have only second doc in the index)
curl -s -XPOST http://localhost:9200/_plugins/_sql -H 'Content-Type: application/json' -d '{"query": "select * from dbg", "fetch_size": 20}'
What is the expected behavior?
TBD
Why legacy response is incorrect?
It declares data type as long, but returns a number and array of numbers. Imagine a user has a parser for response, what should parser do with such values? You can try our JDBC driver as an example of a customer application.
What is your host/environment?
main
@ 6108ca1a1