opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
120 stars 139 forks source link

[DISCUSSION] Properly support array values in new engine #1300

Closed GumpacG closed 2 weeks ago

GumpacG commented 1 year ago

What is the bug?

The new engine does not return values in an array while the Legacy engine returns all values in a row as an array. Implementing same support as V1 does isn't a right way, because legacy engine produces inconsistent value.

How can one reproduce the bug?

Steps to reproduce the behavior:

  1. Start OpenSearch server
  2. Clean index if was created before: curl -XDELETE 'http://localhost:9200/dbg'
  3. Create a simple index with automatic mapping: curl -X POST "localhost:9200/dbg/_doc/?pretty" -H 'Content-Type: application/json' -d '{"myNum": 5}'
  4. Query data: select * from dbg. Not bas so far.
  5. Add new doc: curl -X POST "localhost:9200/dbg/_doc/?pretty" -H 'Content-Type: application/json' -d '{"myNum": [3, 4]}'
  6. Check mapping: curl -X GET "localhost:9200/dbg?pretty"
    "mappings" : {
    "properties" : {
    "myNum" : {
      "type" : "long"
    }
    }
    }
  7. Query in the new engine: curl -s -XPOST http://localhost:9200/_plugins/_sql -H 'Content-Type: application/json' -d '{"query": "select * from dbg"}'
    {
    "schema": [
    {
      "name": "myNum",
      "type": "long"
    }
    ],
    "datarows": [
    [
      5
    ],
    [
      3
    ]
    ],
    "total": 2,
    "size": 2,
    "status": 200
    }

    (if you have only second doc in the index)

    "schema": [
    {
      "name": "myNum",
      "type": "long"
    }
    ],
    "datarows": [
    [
      3
    ],
    [
      4
    ]
    ]
  8. Query in the legacy engine: curl -s -XPOST http://localhost:9200/_plugins/_sql -H 'Content-Type: application/json' -d '{"query": "select * from dbg", "fetch_size": 20}'
    {
    "schema": [
    {
      "name": "myNum",
      "type": "long"
    }
    ],
    "total": 2,
    "datarows": [
    [
      5
    ],
    [
      [
        3,
        4
      ]
    ]
    ],
    "size": 2,
    "status": 200
    }

What is the expected behavior?

TBD

Why legacy response is incorrect?

It declares data type as long, but returns a number and array of numbers. Imagine a user has a parser for response, what should parser do with such values? You can try our JDBC driver as an example of a customer application.

What is your host/environment?

main @ 6108ca1a1

Yury-Fridlyand commented 1 year ago

We had a team discussion outside of GH and I'll list ideas we got and notes for them.

Return all values as array when there is a mix of array<type> and type

TL;DR Not applicable. Imagine you have this fix done and you repeat experiment given above.

  1. When index dbg has one doc, SQL returns type long, but when second doc posted to the index, SQL returns array.
  2. When you query for the first doc, SQL returns type long, but when you query for the second (or both), SQL returns array.

The user didn't change index/mapping/field/column type/etc, but data type was changed, it is unacceptable.

Always return type as array

TL;DR Not applicable.

A user would lose information about data type of a column/field.

Add new REST argument to parametrize the query

It could be a part of the url, e.g. localhost:9200/_plugins/_sql?param=value or part of json body, e.g.

{
  "query": "select * from dbg",
  "param": "value"
}

In that case user would be able to specify whether it wants a loose response (like legacy does now) or strict. TBD what to do with array values in that case? Replace by nulls? Don't return?

Put all responsibility on user: CAST

Enforce strict mode. A user should do cast to get array values. Without cast they should be TBD (omitted? nulled?). Example:

SELECT CAST(myNum AS ARRAY), myNum FROM dbg;

Cast to array should be implemented though.

Put all responsibility on user: PartiQL

Ref link: https://partiql.org/tutorial.html#variables-can-range-over-data-with-different-types

Enforce strict mode. A user should specify how to interpret values using combination of CASE and IS clauses. TBD what to do with value if no instructions given (omit? null? error?).

SELECT CASE WHEN (myNum IS ARRAY) THEN p[0]
       ELSE p END AS myNum
FROM dbg;

or

SELECT num AS myNum FROM dbg as d, d.myNum[0] as num

To be continued...

Ideas are welcome!

Notes

1. In any case a user should be able to understand that there is a complex value in the index. Consider updating response for DESCRIBE query. 2. If complex value produced by a function (e.g. nested) and can't be inserted into response (for example, CAST or CASE is missing), an error should be raised. It should unambiguously say what is missing in the query.

dai-chen commented 1 year ago

@Yury-Fridlyand Thanks for sharing the notes! Have we considered also the idea in Preto/Trino: https://github.com/opensearch-project/sql/issues/442#issuecomment-1246068445 ?

Yury-Fridlyand commented 1 year ago

Following that the response be like

{
  "schema": [
    {
      "name": "myNum",
      "type": "long",
      "array": true
    }
  ],
  "total": 2,
  "datarows": [
    [
      5
    ],
    [
      [
        3,
        4
      ]
    ]
  ],
  "size": 2,
  "status": 200
}

Why not? It could be implemented along with CAST/PartiQL approach. Keep in mind it is a breaking change.

GumpacG commented 1 year ago

This also concerns fields being passed in to functions as only the last value in the array is used. Also, note the incorrect type with the example below. For example: Query: SELECT upper(name), name FROM string_array Output:


{
    "schema": [
        {
            "name": "upper(name)",
            "type": "double"
        },
        {
            "name": "name",
            "type": "keyword"
        }
    ],
    "total": 5,
    "datarows": [
        [
            "JOE",
            "joe"
        ],
        [
            "STEVE",
            "steve"
        ],
        [
            "BOB",
            "bob"
        ],
        [
            "ANDY",
            "andy"
        ],
        [
            "ADAM",
            [
                "david",
                "adam"
            ]
        ]
    ],
    "size": 5,
    "status": 200
}
akuzin1 commented 1 year ago

A current blocker for implementing Array Support in JDBC driver, so this would be a great feature to have, any status update on priority?

Yury-Fridlyand commented 1 year ago

I'm going to research possible solutions for this and share for discussion. Meanwhile I found more tricky sample, which we should consider as well.

POST dbg/_doc
{
  "obj": [[1, 2], [3, 4], 5]
}

search result:

{
  "took" : 361,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "dbg",
        "_id" : "uDBnoYgB5NyEnr3HyanK",
        "_score" : 1.0,
        "_source" : {
          "obj" : [
            [
              1,
              2
            ],
            [
              3,
              4
            ],
            5
          ]
        }
      }
    ]
  }
}

mapping:

{
  "dbg" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "obj" : {
          "type" : "long"
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1686168574104",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "LN-lYUQFSsi9MVpa5VT-Zg",
        "version" : {
          "created" : "136297827"
        },
        "provided_name" : "dbg"
      }
    }
  }
}
Yury-Fridlyand commented 1 year ago

Please proceed with discussion in https://github.com/opensearch-project/sql/discussions/1733.

acarbonetto commented 1 year ago

Supporting all of the above use cases will take multiple tries, and each should be dealt with separately. We can separate each use cases into a individual issues.

To solve the primitive/array expanding into multiple rows, we should try and use the metadata (cues) to determine if the data is treated as a primitive object or an array (we cannot do both easily). Doing something like what Presto/Trino supports in the index mapping: https://trino.io/docs/current/connector/elasticsearch.html#array-types would be simple. It would indicate that the mapped symbol should treat the data record as an array (or an array of 1 if the data is not defined as an array).

We should PoC this and see if it works to solve https://github.com/opensearch-project/sql/discussions/1733#discussioncomment-6157898.