rcongiu / Hive-JSON-Serde

Read - Write JSON SerDe for Apache Hive.
Other
733 stars 391 forks source link

Unable to infer the decimal and cast to int #211

Closed ghost closed 5 years ago

ghost commented 5 years ago

Hello Developers,

I have a json file which has following content: {"mynumber":4.5} {"mynumber":2}

I have created following table in EMR hive using the openx json serde to parse this file

create external table test ( mynumber int) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 'file:///home/hadoop/myjsondata';

The table was successfully created and its definition looks as follows:

hive> show create table test; OK CREATE EXTERNAL TABLE test( mynumber int COMMENT 'from deserializer') ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'file:/home/hadoop/myjsondata' TBLPROPERTIES ( 'transient_lastDdlTime'='1541113574') Time taken: 0.722 seconds, Fetched: 12 row(s)

Now when I try to query this table, I get the following error:

hive> select * from test; OK Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NumberFormatException: For input string: "4.5" Time taken: 1.815 seconds

My expectation is that 4.5 is a valid double data type and hence should be casted to integer with output for that 1st record as 4.

However, serde has interperted 4.5 as string and was not able to cast string to integer. When I create a new hive table with mynumber as "double", it works fine.

Could you please let me know what I'm missing here

rcongiu commented 5 years ago

Yes, 4.5 is a valid double, but you are creating the table as int.

Try with : create external table test ( mynumber double) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 'file:///home/hadoop/myjsondata';

It's trying to parse 4.5 as an int and rightly failing since it's not an int.

ghost commented 5 years ago

Hello @rcongiu ,

Thank you very much for the super quick response! As I said earlier, when I change the type to 'double' in the DDL, it works fine.

When I create the table with native Hive json serde and declare "mynumber" column as int, the double values are properly casted to 'int'.

Can't I expect the same behavior with openx JSON serde?

create external table test_hivejsonserde ( mynumber int) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' location 'file:///home/hadoop/myjsondata';

hive> show create table test_hivejsonserde; OK CREATE EXTERNAL TABLE test_hivejsonserde( mynumber int COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'file:/home/hadoop/myjsondata' TBLPROPERTIES ( 'transient_lastDdlTime'='1541179985') Time taken: 0.136 seconds, Fetched: 12 row(s)

hive> select * from test_hivejsonserde; OK 4 2 Time taken: 0.133 seconds, Fetched: 2 row(s)

rcongiu commented 5 years ago

I disagree that's what it should happen. In the design of the SerDe we favor correctness over convenience. The worst kind of bugs happen when something is silently changed without warning. In this case, you'd have numbers that are floating point silently cast to something else. I think it's a better solution to either fix the data at the source to produce integers, or to create the table as float, and then have the user cast to int. The table should be defined as the data is, if you really want an automatic cast you can always write a view on top of the table.

ghost commented 5 years ago

Hey @rcongiu ,

Thanks for confirming the expected behaviour. I would try to either correct the schema or source data.