rcongiu / Hive-JSON-Serde

Read - Write JSON SerDe for Apache Hive.
Other
733 stars 391 forks source link

Cast nested JSON field into string get wrong result #204

Closed sunshineo closed 4 years ago

sunshineo commented 6 years ago

Data: {"test":{"test1":"test1","test2":2}} The content of "test" is dynamic. Here it has 2 fields, one string, one number. Table definition: CREATE EXTERNAL TABLE IF NOT EXISTS test ( customAttributes string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) LOCATION 's3://***';

However when I SELECT, I got {"test1":"test1","test2":"2"} Notice the 2 became a string instead of a number. Thank you very much

rcongiu commented 4 years ago

The SerDe parses everything as string, since it cannot know if a number is an int, a float , a bigint or a double. So it delays it, keeps whatever it finds in the json as string, then relies on the schema passed at the moment of the serde initialization to convert it to the correct type. For this reason, when doing dynamic discovery, it is passed as string, which is also the safest way. The SerDe is not in the business of trying to guess what type your field may be, since it's a very hard problem to solve in a distributed environment. Closing since this behavior is expected and is correct.

sbrandtb commented 2 years ago

@rcongiu

I disagree - a bit. If we were talking about recognising that some particular string field should be a timestamp or date - yes, I agree. But look at following, a bit more complete example:

{"the_name": "a", "the_data": {"null_list": [null], "int_list": [1980, 12, 12], "scalar_bool": true, "scalar_null": null}}
CREATE EXTERNAL TABLE the_table (
  the_name string,
  the_data string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'where-ever'

SELECT * FROM the_table
+--------+--------------------------------------------------------------------------------------------+
|the_name|the_data                                                                                    |
+--------+--------------------------------------------------------------------------------------------+
|a       |{"scalar_bool":true,"int_list":["1980","12","12"],"scalar_null":"null","null_list":["null"]}|
+--------+--------------------------------------------------------------------------------------------+

Please note that

What we can see:

I do not think that we expect the serde to detect datetimes or other weird casts. Just that basic literals like ints, floats, null and bools stay ints, floats, null and bools.

My use case is handling of dynamic schemas in keys, where loading columns as strings and then using json_extract in PrestoDB's or Spark's side is very useful. As of now, it works, but is very tedious given that we need to cast and use NULLIF(..., "null")

Related to #208 I think.

rpovilaityte commented 1 year ago

sbrandtb I totally agree, especially about the NULL behavior.. Did you find any other way instead of nullifying (NULLIF(..)) the values?

sbrandtb commented 1 year ago

@rpovilaityte Not directly. Our workaround is to use the fact that the SerDe is very forgiving to our advantage. The SerDe is basically fine with cases:

The only case that this SerDe can not handle is a defined field with data type that can not be cast, i.e. in my example if we would define int_list as a struct.

So, ultimately, what we are doing is to always define the superset of all expected schemas in our queries and we luckily did never run into a situation where a particular field has such differing data types that we would have to use dynamic extraction with json_extract().

The only disadvantage is that if a record violates the schema, then you might end with being unable to parse the particular file entirely. In our case, we do schema validation before data enters our data lake, so that issue is 99% taken care of.