Closed ChuckConnell closed 7 years ago
You should declare then the column as "String" instead of int. The SerDe will be able to read the numbers into strings, then you can CAST them in hive.
But what if I have a million correct rows in the Hive table, and just one int value is a string? It doesn't make sense to always declare all ints as strings, just in case one value is bad.
Chuck
From: Roberto Congiu [notifications@github.com] Sent: Tuesday, September 18, 2012 12:02 PM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Wrong datatype causes crash, ignore.malformed.json does not help (#12)
You should declare then the column as "String" instead of int. The SerDe will be able to read the numbers into strings, then you can CAST them in hive.
— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/12#issuecomment-8659929.
I guess a better solution would be to check if the value is a string, and use parseInt/parseFloat etc.
That can be done, although will require some time because of how SerDes work. I will need to extend all the primitive numeric ObjectInspectors (smallint, int, float, etc) and have them parse the string if it is one..
I was just thinking that it should return NULL if any datatype is wrong, which is what the SerdeProperties setting said it will do. I don't think the SerDe is obligated to cast my (incorrect) "ten" into 10.
From: Roberto Congiu [notifications@github.com] Sent: Tuesday, September 18, 2012 1:50 PM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Wrong datatype causes crash, ignore.malformed.json does not help (#12)
I guess a better solution would be to check if the value is a string, and use parseInt/parseFloat etc.
That can be done, although will require some time because of how SerDes work. I will need to extend all the primitive numeric ObjectInspectors (smallint, int, float, etc) and have them parse the string if it is one..
— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/12#issuecomment-8663589.
no, but if you have "12" it probably should. Still, to override the primitive object inspector is not a one line fix...
OK, up to you. I don't mind getting a NULL if I have bad data.
From: Roberto Congiu [notifications@github.com] Sent: Tuesday, September 18, 2012 2:18 PM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Wrong datatype causes crash, ignore.malformed.json does not help (#12)
no, but if you have "12" it probably should. Still, to override the primitive object inspector is not a one line fix...
— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/12#issuecomment-8664562.
I want to ask some thing, I have nested json data, serde not read int value that is inside
{"id":123,"location":"LA","person":{name:"enes","surname":"yucer","salary":6000}}
I didn't get salary value but I get id,
CREATE EXTERNAL TABLE USER( id INT, person STRUCT< name:STRING,surname:STRING,salary:int >) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION '/user/enes/user'
is there any suggestion? thanks.
Same here ignore.malformed.json is not working. Columns are from type string.
java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Illegal escape
Thanks!
I have the same opnion that spoom is seems to not be working as expected I will take a look. I'm using hive 0.9.0
--bump--
all columns type string but ignore.malformed.json set to true doesn't seem to be working correctly. ex:
create external table t1 ( f1 string, f2 string) partitioned by (y string, m string, d string, h string) row format serde 'org.openx.data.jsonserde.JsonSerDe' with serdeproperties ("ignore.malformed.json" = "true") location '${hiveconf:INPUT}';
some data such as this exists:
"f2":"ABCD\_123\_456\_789"
after querying, dies on this content:
... Caused by: org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Illegal escape. at 164 [character 165 line 1] at org.openx.data.jsonserde.JsonSerDe.onMalformedJson(JsonSerDe.java:382) at org.openx.data.jsonserde.JsonSerDe.deserialize(JsonSerDe.java:167) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:628) ... 9 more
Rathboma fork doesn't present this issue, worked for me : ) https://github.com/rathboma/Hive-JSON-Serde-1/network
We have been using the json serde mostly and have resolved a Lot of Casting issues during deserialization:
2.We have also resolved the Alter table replace schema issue by not caching the schema and seen that changing the serde and modifying the schema corrupts the older partitions.But with Add/Change Column now it works fine.
3.Recently we encountered an issue where we had an array
We wanted to know what is the best way to identify set of all objects that we can expect and would need a cast as every now and then we get a different classcast Exception?
your response will be appreciated.
Thanks
For the casting , the issues should already be all resolved in the latest release as I changed how primitive types are parsed, delaying parsing.
For handling arrays , I have to think what's the best strategy, maybe a special setting to tell the serde to expect them
Sent from my iPhone, please excuse typos and autoincorrect
Il giorno Feb 6, 2014, alle ore 3:45 PM, Mohit notifications@github.com ha scritto:
We have been using the json serde mostly and have resolved a Lot of Casting issues during deserialization:
- resolved when the data values are being casted from primitive types to Java Object Types using parse etc.
2.We have also resolved the Alter table replace schema issue by not caching the schema and seen that changing the serde and modifying the schema corrupts the older partitions.But with Add/Change Column now it works fine.
3.Recently we encountered an issue where we had an array at the root level of the table. This had to be handled in the JsonStructObjectInspector where potentially you can get JSONArrayObject (and not only a List Object) and the return type should be JSONObject always.With this we also eliminated a redundant step of cast.
We wanted to know what is the best way to identify set of all objects that we can expect and would need a cast as every now and then we get a different classcast Exception?
your response will be appreciated.
Thanks
— Reply to this email directly or view it on GitHub.
Thanks for your response.
For Handling JsonArray we made this change and it works! Added this to getStructFieldData -> if (data instanceof org.openx.data.jsonserde.json.JSONArray) return getStructFieldDataFromList( ((org.openx.data.jsonserde.json.JSONArray) data) .getAsArrayList(), fieldRef);
Changed this method to not cast again to JsonObject(has call looks redundant):
public List<Object> getStructFieldsDataAsList(Object o) {
values.clear();
for (int i = 0; i < fields.size(); i++) {
values.add(getStructFieldData(o, fields.get(i)));
}
return values;
}
For Primitive cast we have done something similar by creating different OIs and perform parsing. Question: Are there any other data types that are expected to be casted (may be Map or any other Collection)?
That's a good idea, I'll see if I can add it tonight or this weekend to the codebase.
"Good judgment comes from experience.
Data Engineer - OpenX.org Pasadena, CA Skype: sardodazione Y! IM: rcongiu
On Thursday, February 6, 2014 3:55 PM, Mohit notifications@github.com wrote:
Thanks for your response.
Question: Are there any other data types that are expected to be casted (may be Map or any other Collection) For Handling JsonArray we made this change and it works! Added this to getStructFieldData -> if (data instanceof org.openx.data.jsonserde.json.JSONArray) return getStructFieldDataFromList( ((org.openx.data.jsonserde.json.JSONArray) data) .getAsArrayList(), fieldRef); Changed this method to not cast again to JsonObject(has call looks redundant): @Override public List getStructFieldsDataAsList(Object o) { values.clear(); for (int i = 0; i < fields.size(); i++) { values.add(getStructFieldData(o, fields.get(i))); } return values; } — Reply to this email directly or view it on GitHub.
Hi Robert, just checking again. Question: Are there any other data types that are expected to be casted (may be Map,any other Collection or others)? your response would be appreciated.
Thanks!
I am not sure what you mean, but most times, the problem with casting is that the data is not in the form that hive expects, so if you declare a column like a string, and you receive an array, you'd get a classcast exception.
OK. I guess I was asking ignore.malformed.json to hide this problem. But I see that the JSON in my original post is actually correct (not malformed) although it is inconsistent. So this is not really a reasonable request. I withdraw the bug report.
So, reading this thread from the beginning, I take the problem is with mixed data it would still crash (like "ten" string instead of int). I'll have a look at ignore.malformed.json. I changed the way numbers are handled and now I realized it may have broken ignore.malformed.json (since now the number conversion happens after the actual json parsing is done).
just wanted to clarify "receive an array"( my question is how to identify which type will be returned/expected a JSONArray or List(ArrayList) or any other collection or Object)?
I've mostly worked around this problem by introducing json schema validation before input into Hive, but I would still greatly appreciate an option to allow incorrect data types to return null instead of causing a NumberFormatException and crashing the query.
Are you still getting this with the latest code ? I did change the code to handle numbers differently , that is, the json parser will leave them as string, and the conversion to the actual type happens at the end, so, you shouldn't be getting the NumberFormatException anymore. If so, can you provide an example ?
Yes, I am getting this with the latest release (1.1.9.2). In particular, I've got an object that looks something like this:
{
"subobject" : {
"field" : "ab6c3ab"
}
}
The serde looks something like this:
CREATE EXTERNAL TABLE table__serde_view (
subobject struct<
field:int
>
)
PARTITIONED BY (field1 STRING, field2 INT, field3 INT, field4 INT, field5 INT)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Thanks, will look at it later.
Sent from my iPhone, please excuse typos and autoincorrect
Il giorno Mar 18, 2014, alle ore 6:42 PM, Mark Roberts notifications@github.com ha scritto:
Yes, I am getting this with the latest release (1.1.9.2). In particular, I've got an object that looks something like this:
{ "subobject" : { "field" : "ab6c3ab" } } The serde looks something like this:
CREATE EXTERNAL TABLE table__serde_view ( subobject struct< field:int
) PARTITIONED BY (field1 STRING, field2 INT, field3 INT, field4 INT, field5 INT) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'; — Reply to this email directly or view it on GitHub.
This is the stack trace I'm receiving when using 1.1.9.3-SNAPSHOT (HEAD).
Caused by: java.lang.NumberFormatException: For input string: "Matchmaking" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseInt(ParsePrimitiveUtils.java:33) at org.openx.data.jsonserde.objectinspector.primitive.JavaStringIntObjectInspector.get(JavaStringIntObjectInspector.java:46) at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:204) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:546) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:618) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:136) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534) ... 9 more
Thanks, I found where the issue is, I just got to decide how to fix it.
Sent from my iPhone, please excuse typos and autoincorrect
Il giorno Mar 19, 2014, alle ore 2:16 PM, Mark Roberts notifications@github.com ha scritto:
This is the stack trace I'm receiving when using 1.1.9.3-SNAPSHOT (HEAD).
Caused by: java.lang.NumberFormatException: For input string: "Matchmaking" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseInt(ParsePrimitiveUtils.java:33) at org.openx.data.jsonserde.objectinspector.primitive.JavaStringIntObjectInspector.get(JavaStringIntObjectInspector.java:46) at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:204) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:546) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:618) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:136) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534) ... 9 more
— Reply to this email directly or view it on GitHub.
Any word on this?
Still working on this. Unfortunately it requires quite a bit of code change, as I need to make objectinspectors either fail or not, depending on the 'ignore' setting. I'll see if I can finish it this weekend. R.
"Good judgment comes from experience.
Data Engineer - OpenX.org Pasadena, CA Skype: sardodazione Y! IM: rcongiu
On Friday, April 4, 2014 5:47 PM, Mark Roberts notifications@github.com wrote:
Any word on this?
— Reply to this email directly or view it on GitHub.
Hey, This is the problem i am facing using 1.1.9.2
Caused by: java.lang.NumberFormatException: For input string: "?a(?"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at org.openx.data.jsonserde.json.JSONTokener.nextString(JSONTokener.java:284)
at org.openx.data.jsonserde.json.JSONTokener.nextValue(JSONTokener.java:366)
at org.openx.data.jsonserde.json.JSONObject.
so i am querying a huge volume of data which is dumped to hadoop via a web application. So it may have garbage values in files., but when i am querying it should not fail ever in any case.
First prob: if i have multiple partitions i won't come to know in which file the data is corrupted. Second Prob: when i am querying such a huge volume (like 1TB) i do not care if any garbage data.
So you should eighter add option to ignore all errors processing a single line or catch all exceptions
Bumping the thread -- still not able to ignore malformed json in the version 1.3.
I believe I am having somewhat of a similar problem. I have a collection of JSON records where one field is misbehaving and I am not sure the best way to handle it in Hive. So far, have had great luck with this SerDe.
authors field is sometimes a single string and sometimes an array (thanks data provider). I typed it as an array
I read this and a few other issues but don't see exactly the way i am meant to handle this. Obviously a relatively novice Hiver here though quite familiar with json.
Mmm... It should be relatively easy to change it so if you have an array field and you got a scalar, it converts it to a 1-element array. I think that would solve your problem.
Ok...I actually pushed a change that should work for you...it's in the devel branch and should work fine (I added a unit test too). See https://github.com/rcongiu/Hive-JSON-Serde#promoting-a-scalar-to-an-array and please let me know if it works for you.
Thank you Roberto; great stuff.
I have to wait for someone to load the jar for me so it may be until Monday when i can confirm this fixes my issue as i suspect it does.
Cord
From: Roberto Congiu notifications@github.com<mailto:notifications@github.com> Reply-To: rcongiu/Hive-JSON-Serde reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, September 23, 2016 at 6:04 PM To: rcongiu/Hive-JSON-Serde Hive-JSON-Serde@noreply.github.com<mailto:Hive-JSON-Serde@noreply.github.com> Cc: Cord Thomas cthomas@rand.org<mailto:cthomas@rand.org>, Comment comment@noreply.github.com<mailto:comment@noreply.github.com> Subject: Re: [rcongiu/Hive-JSON-Serde] Wrong datatype causes crash, ignore.malformed.json does not help (#12)
Ok...I actually pushed a change that should work for you...it's in the devel branch and should work fine (I added a unit test too). See https://github.com/rcongiu/Hive-JSON-Serde#promoting-a-scalar-to-an-array and please let me know if it works for you.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/12#issuecomment-249335160, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATlvME9Hu_IEJOHJDe3EgHGET8gog632ks5qtHcCgaJpZM4AKJCy.
This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Actually,
I have installed on test host and fairly certain i did that correctly. Attached is a sample dataset that fails and here is my schema. This is publicly available data.
Does this work for you? Or am i doing something really wrong?
Executing this query fails starting at limit 6, which is CB-142
select document_number, authors from pubs limit 6;
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONArray
Cord
ADD JAR …. hive/lib/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar;
CREATE EXTERNAL TABLE pubs (
document_number STRING,
document_title STRING,
series STRING,
inv_status STRING,
product_formats STRING,
product_date SMALLINT,
long_abstract STRING,
short_abstract STRING,
categories ARRAY
From: Roberto Congiu notifications@github.com<mailto:notifications@github.com> Reply-To: rcongiu/Hive-JSON-Serde reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, September 23, 2016 at 6:04 PM To: rcongiu/Hive-JSON-Serde Hive-JSON-Serde@noreply.github.com<mailto:Hive-JSON-Serde@noreply.github.com> Cc: Cord Thomas cthomas@rand.org<mailto:cthomas@rand.org>, Comment comment@noreply.github.com<mailto:comment@noreply.github.com> Subject: Re: [rcongiu/Hive-JSON-Serde] Wrong datatype causes crash, ignore.malformed.json does not help (#12)
Ok...I actually pushed a change that should work for you...it's in the devel branch and should work fine (I added a unit test too). See https://github.com/rcongiu/Hive-JSON-Serde#promoting-a-scalar-to-an-array and please let me know if it works for you.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/12#issuecomment-249335160, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATlvME9Hu_IEJOHJDe3EgHGET8gog632ks5qtHcCgaJpZM4AKJCy.
This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Check that - it would seem i did not properly reload the SerDe library. This resolved my issue. Thank you Roberto.
yay! You're welcome.
I think the OP's isssue is fixed, this can be closed?
Yes
A value with a wrong datatype causes the generated MR job to crash. ignore.malformed.json does not seem to fix it.
Here is the sample data, mixed2.json
Here is the sample Hive script, mixed2.hive. The first query (on f1) works. The other queries (on * and f2) crash. It would be nice to see NULL or something else. The get_json_object() function actually returns the bad string, so it prints "ten"!