trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.31k stars 2.97k forks source link

Add support to cast type VARBINARY as JSON #10814

Open cccs-joel opened 2 years ago

cccs-joel commented 2 years ago

Summary:

As mentioned at JSON doc, some types are currently not supported with the function CAST AS JSON. This might be due to the fact that the JSON format natively doesn't support binary data. Our use case is to be able to cast a complex column (row, arrays...) as json with some values being unsupported types.

Description:

SELECT CAST(
CAST(TIMESTAMP '2022-01-01' AS TIMESTAMP)
AS JSON)

does work even if the TIMESTAMP is not in the supported type list

SELECT CAST(
CAST(TIMESTAMP '2022-01-01' AS TIMESTAMP WITH TIME ZONE)
AS JSON)

does not work with error 'Cannot cast timestamp(3) with time zone to json'.

SELECT CAST(
X'65683F'
AS JSON)

does not work with error 'Cannot cast varbinary to json'.

As Spark SQL and potentially other engines do, Trino could convert these values. For example, the following Spark SQL example

SELECT TO_JSON(
struct(X'65683F')
) AS x

is fine and returns something like {"col1":"ZWg/"}.

Can the cast for timestamp with time zone as json be fixed?

Would Trino accept to add support to cast binaries to json?

kasiafi commented 2 years ago

Hi @cccs-joel, we have recently added a couple of functions to process JSON. You can use them to read JSON data from varbinary. Please note that these functions won't return JSON type, but they return character string or binary string representation. JSON type is not ANSI-compliant, and with the new functions, we followed the spec carefully.

Let me show how it works. The varbinary representation of your input is:

trino:tiny> SELECT CAST ('{"col1":"ZWg/"}' AS varbinary);
                    _col0
----------------------------------------------
 7b 22 63 6f 6c 31 22 3a 22 5a 57 67 2f 22 7d
(1 row)

I'll use the json_query function to read that varbinary, and return back the text:

trino:tiny> SELECT JSON_QUERY(X' 7b 22 63 6f 6c 31 22 3a 22 5a 57 67 2f 22 7d', 'strict $');
      _col0
-----------------
 {"col1":"ZWg/"}
(1 row)

The second argument strict $ is the JSON path. Here I used it only to return the input object ($), but it is a very powerful tool to dig into your JSON data.

The additional bonus is that the json_query function parses and validates the JSON. It also formats it to the canonical form, by removing whitespace.

If you're interested in json_query and other JSON functions we added, or in the capabilities of processing with JSON path, here's the relevant docs section: https://trino.io/docs/current/functions/json.html

cccs-joel commented 2 years ago

Hi @kasiafi . Thanks for your reply. It's actually the other way around: we want to convert a Varbinary or timestamp to JSON. These two types are not currently supported by the CAST(x as JSON)

wenrui-meng-ai commented 2 months ago

Bump up this issue.

We also have need to cast the varbinary to json. Most of our table in lakehouse are ingested from proto object. In proto, byte array converted to varbinary. I made this PR to convert the column to json so it can be deserialize the json back to original proto object with Base64 encoding

Here is my PR

Is there any objection to support casting varbinary to json?