Cast nested JSON field into string get wrong result

sunshineo commented 6 years ago

Data: {"test":{"test1":"test1","test2":2}} The content of "test" is dynamic. Here it has 2 fields, one string, one number. Table definition: CREATE EXTERNAL TABLE IF NOT EXISTS test ( customAttributes string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) LOCATION 's3://***';

However when I SELECT, I got {"test1":"test1","test2":"2"} Notice the 2 became a string instead of a number. Thank you very much

rcongiu commented 4 years ago

The SerDe parses everything as string, since it cannot know if a number is an int, a float , a bigint or a double. So it delays it, keeps whatever it finds in the json as string, then relies on the schema passed at the moment of the serde initialization to convert it to the correct type. For this reason, when doing dynamic discovery, it is passed as string, which is also the safest way. The SerDe is not in the business of trying to guess what type your field may be, since it's a very hard problem to solve in a distributed environment. Closing since this behavior is expected and is correct.

sbrandtb commented 2 years ago

@rcongiu

I disagree - a bit. If we were talking about recognising that some particular string field should be a timestamp or date - yes, I agree. But look at following, a bit more complete example:

{"the_name": "a", "the_data": {"null_list": [null], "int_list": [1980, 12, 12], "scalar_bool": true, "scalar_null": null}}

CREATE EXTERNAL TABLE the_table (
  the_name string,
  the_data string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'where-ever'

SELECT * FROM the_table

+--------+--------------------------------------------------------------------------------------------+
|the_name|the_data                                                                                    |
+--------+--------------------------------------------------------------------------------------------+
|a       |{"scalar_bool":true,"int_list":["1980","12","12"],"scalar_null":"null","null_list":["null"]}|
+--------+--------------------------------------------------------------------------------------------+

Please note that

explicit.null does not change the behaviour for null in my example

What we can see:

Basic literals like int and null are cast to string
But, and here is the place where I disagree with you: bool stays bool, which is super weird.

I do not think that we expect the serde to detect datetimes or other weird casts. Just that basic literals like ints, floats, null and bools stay ints, floats, null and bools.

My use case is handling of dynamic schemas in keys, where loading columns as strings and then using json_extract in PrestoDB's or Spark's side is very useful. As of now, it works, but is very tedious given that we need to cast and use NULLIF(..., "null")

Related to #208 I think.

rpovilaityte commented 1 year ago

sbrandtb I totally agree, especially about the NULL behavior.. Did you find any other way instead of nullifying (NULLIF(..)) the values?

sbrandtb commented 1 year ago

@rpovilaityte Not directly. Our workaround is to use the fact that the SerDe is very forgiving to our advantage. The SerDe is basically fine with cases:

Field defined, but missing from JSON: returns null
Field defined, present in JSON, is null: returns null
Field defined, wrong data type, but can be implicitly cast: cast type returned

The only case that this SerDe can not handle is a defined field with data type that can not be cast, i.e. in my example if we would define int_list as a struct.

So, ultimately, what we are doing is to always define the superset of all expected schemas in our queries and we luckily did never run into a situation where a particular field has such differing data types that we would have to use dynamic extraction with json_extract().

The only disadvantage is that if a record violates the schema, then you might end with being unable to parse the particular file entirely. In our case, we do schema validation before data enters our data lake, so that issue is 99% taken care of.

rcongiu / Hive-JSON-Serde

Cast nested JSON field into string get wrong result #204