Open sztanko opened 8 years ago
I am looking at the patch and I think it's a good idea, but the main concern I have is that in case of an error it returns a default value of 0, I think it should be NULL. Let's say you have a field like salary, and the parsing of "$100,000" fails because of a dollar sign... you'd have a record that it's been silently converted from 100000 to 0, so anybody taking AVG, MIN on that field will be affected. Let me think about it....
I absolutely agree with you and that would be my initial intention, however these are primitives, not objects and returning null is not possible...
Yeah, I am looking at the code and it's a cascade of changes... to make it an option I have to change the objectinspector caching, plus all the objectinspectors... in Scala it would be easy just to pass a function but ... not that easy in Java
I will try to do it myself, time permitting.
I am actually working on it too :)
Oh cool, thank you :)
So I spent a good amount of time looking at this and I think it can't be easily done. This is why: right now the JSON parsers will not convert string to numbers because it doesn't know ahead of time which numeric type to parse it to - int, long, etc. - so it will carry around a string, and the JavaStringxxxObjectInspector will do the conversion. The problems arise when there's a NULL. Now, because of parsing, a NULL could be an actuall NULL, or it could be an unparseable string. Unfortunately hive will check if the object passed is NULL, and call the get() method on the objectinspector because it thinks it's safe, but the objectinspector will fail. The right way to do it is to have the table schema passed down to the JSON parser and have that do the conversion so we don't have to worry about it later.
Handle invalid primitives in json:
Problem we experience quite often:
Looking at the code and exception stack trace,
and relevant codes
MapOperator.java:process
my understanding is that currently there is no way to catch these exceptions via configuration options (e.g. mapreduce.job.skiprecords). The only thing that seems viable is to catch these kind of problems using try/catch within methods of ParsePrimitiveUtils. I understand this is probably the ugliest solution and a better pattern might be to pre-validate json, but unfortunately this doesn't seem to be a feasible situation in our case.
I will be pleased if you merge this PR, however I understand that this is not the ideal situation and you might not accept it. In any case, I am open to your suggestions regarding this.