rcongiu / Hive-JSON-Serde

Read - Write JSON SerDe for Apache Hive.
Other
733 stars 393 forks source link

[add]: Added try/catches to primitive parsing methods #127

Open sztanko opened 8 years ago

sztanko commented 8 years ago

Handle invalid primitives in json:

Problem we experience quite often:

dfs -rm -f -r /tmp/err_test;
dfs -mkdir /tmp/err_test;
drop table if exists err_test;
drop table if exists err_string_test;
--add  jar /tmp/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar;
create external table err_test(a int) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location '/tmp/err_test/';
create external table err_string_test(a string) location '/tmp/err_test/';
insert into err_string_test VALUES('{"a" :"not_an_int"}'),('{"a" :"1"}');
select * from err_string_test order by a LIMIT 10;  --this will go through
select * from err_test order by a LIMIT 10;  -- this will fail

Looking at the code and exception stack trace,

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.NumberFormatException: For input string: "not_an_int"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseInt(ParsePrimitiveUtils.java:33)
    at org.openx.data.jsonserde.objectinspector.primitive.JavaStringIntObjectInspector.get(JavaStringIntObjectInspector.java:45)
    at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:226)
    at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
    at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
    at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
    at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 ]
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.NumberFormatException: For input string: "not_an_int"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseInt(ParsePrimitiveUtils.java:33)
    at org.openx.data.jsonserde.objectinspector.primitive.JavaStringIntObjectInspector.get(JavaStringIntObjectInspector.java:45)
    at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:226)
    at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
    at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
    at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
    at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 ]
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
    ... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NumberFormatException: For input string: "not_an_int"
    at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:403)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
    ... 9 more
Caused by: java.lang.NumberFormatException: For input string: "not_an_int"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseInt(ParsePrimitiveUtils.java:33)
    at org.openx.data.jsonserde.objectinspector.primitive.JavaStringIntObjectInspector.get(JavaStringIntObjectInspector.java:45)
    at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.serialize(BinarySortableSerDe.java:682)
    at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.serialize(BinarySortableSerDe.java:631)
    at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.toHiveKey(ReduceSinkOperator.java:507)
    at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:355)
    ... 15 more

and relevant codes

MapOperator.java:process

for (MapOpCtx current : currentCtxs) {
      Object row = null;
      try {
        row = current.readRow(value, context);
        if (!current.forward(row)) {
          childrenDone++;
        }
      } catch (Exception e) {
        // TODO: policy on deserialization errors
        String message = toErrorMessage(value, row, current.rowObjectInspector);
        if (row == null) {
          deserialize_error_count.set(deserialize_error_count.get() + 1);
          throw new HiveException("Hive Runtime Error while processing writable " + message, e);
        }
        throw new HiveException("Hive Runtime Error while processing row " + message, e);
      }
    }

my understanding is that currently there is no way to catch these exceptions via configuration options (e.g. mapreduce.job.skiprecords). The only thing that seems viable is to catch these kind of problems using try/catch within methods of ParsePrimitiveUtils. I understand this is probably the ugliest solution and a better pattern might be to pre-validate json, but unfortunately this doesn't seem to be a feasible situation in our case.

I will be pleased if you merge this PR, however I understand that this is not the ideal situation and you might not accept it. In any case, I am open to your suggestions regarding this.

rcongiu commented 8 years ago

I am looking at the patch and I think it's a good idea, but the main concern I have is that in case of an error it returns a default value of 0, I think it should be NULL. Let's say you have a field like salary, and the parsing of "$100,000" fails because of a dollar sign... you'd have a record that it's been silently converted from 100000 to 0, so anybody taking AVG, MIN on that field will be affected. Let me think about it....

sztanko commented 8 years ago

I absolutely agree with you and that would be my initial intention, however these are primitives, not objects and returning null is not possible...

rcongiu commented 8 years ago

Yeah, I am looking at the code and it's a cascade of changes... to make it an option I have to change the objectinspector caching, plus all the objectinspectors... in Scala it would be easy just to pass a function but ... not that easy in Java

sztanko commented 8 years ago

I will try to do it myself, time permitting.

rcongiu commented 8 years ago

I am actually working on it too :)

sztanko commented 8 years ago

Oh cool, thank you :)

rcongiu commented 8 years ago

So I spent a good amount of time looking at this and I think it can't be easily done. This is why: right now the JSON parsers will not convert string to numbers because it doesn't know ahead of time which numeric type to parse it to - int, long, etc. - so it will carry around a string, and the JavaStringxxxObjectInspector will do the conversion. The problems arise when there's a NULL. Now, because of parsing, a NULL could be an actuall NULL, or it could be an unparseable string. Unfortunately hive will check if the object passed is NULL, and call the get() method on the objectinspector because it thinks it's safe, but the objectinspector will fail. The right way to do it is to have the table schema passed down to the JSON parser and have that do the conversion so we don't have to worry about it later.