Open jrask opened 11 years ago
Jrask -- I am also working with this SerDe, and would very much like to get it running. There is a typo in your post though. You said
But when I write hive> select payload.params from fresspayload where payload.params.session.id.value == "xxxx";
but you did not say what happens in this case...??
One issue to keep in mind is that there is not (I am fairly sure) a one-to-one correpondence between JSON datatypes and Hive datatypes. So it is sometimes tricky to match types.
Jrask -- I am frustrated trying to get this Serde to run at all. I know many other people have though, so I must have some small error in my setup. Please tell me:
Thank you, Chuck
Maybe I was unclear, I have updated the original issue.
I have built the jar myself from trunk I am running cdh4.0.1 so it seems like it is hive-0.8.1 I am using MRv1, not YARN.
Does your JSON data actually span multiple lines ? Every JSON record should be in a single line.
No, it was just for readability I pasted it in formatted. It is a single line (except that the formatting disappeared....)
Jrask -- The following is valid JSON, but there is no direct correspondence in Hive. Hive arrays must all be of one type.
{ "lname": "smith", "fname": "bob", "mixed-array": [2, 3, {"tag1": "a string" } ] }
Well... I do not have any mixed arrays. My guess is that there might be a problem with nested structs? I have also seen hive errors in the tasktracker logs depending on how I create my tables.
But when these errors occur I get an error in the MR job and no result back so this is not really part of this discussion.
2012-09-11 11:23:23,454 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201209041503_0080_m_000000_3 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","jsonrpc":"2.0","params":{"session":{"id":{"type":"msisdn","value":"xxx"},"type":"app"}}}} at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","jsonrpc":"2.0","params":{"session":{"id":{"type":"msisdn","value":"xxx"},"type":"app"}}}} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143) ... 8 more Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:604) at java.util.ArrayList.get(ArrayList.java:382) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:569) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)
It all works for me now (thanks to rcongiu straightening out my jar directory). I am using nested structs and an array within a struct. Here is my sample data:
{"lname" : "smith", "fname" : "bob", "array1" : [6,5,4], "obj1" : {"field1":"hello", "field2":"goodbye", "array2":[10,11,12]}} {"lname" : "jones", "fname" : "bob", "array1" : [1,2,3], "obj1" : {"field1":"this", "field2":"that", "array2":[13,14,15]}} {"lname" : "white", "fname" : "bill", "array1" : [9,8,7], "obj1" : {"field1":"yin", "field2":"yang", "array2":[16,17,18]}} {"lname" : "black", "fname" : "betty", "array1" : [10,11], "obj1" : {"field1":"bread", "field2":"butter", "array2":[19,21,22]}}
with the following table description:
lname string from deserializer
fname string from deserializer
array1 array
I can execute the query:
select fname,lname from t4 where obj1.array2[1]=17;
with the correct result:
bill white
(The table description above gets broken when reading the comment. It is correct in edit mode.)
Any tips on how I can get my example to work?
This worked for me (didn't reproduce the whole structure but should get you started):
Create table testjson2( payload struct <jsonrpc:string, id:string, params:struct< device_info:struct<model:string, device_id:string>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
hive> select payload.jsonrpc, payload.params.device_info.model from testjson2;
OK 2.0 HTC Desire
Thanks, but Is that really the correct CREATE statement? There is no model inte the table but in the select and it ends with two > but no < ?
Create table testjson2( payload struct <jsonrpc:string, id:string, params:struct< device_info:struct<model:string, device_id:string>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
mmm... github filters the > and <
Create table testjson2( payload struct <jsonrpc:string,
id:string,
params:struct<
device_info:struct<model:string, device_id:string>
>
> )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
I would simplify, and build it up in stages. Strip out most of the structures, get to a small sample that works, then gradually add more parts.
From: jrask [mailto:notifications@github.com] Sent: Tuesday, September 11, 2012 3:48 PM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Where clause matches but result is null (#10)
Any tips on how I can get my example to work?
— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/10#issuecomment-8470275.
Trust me, I have broken it down many times, and each time it fails when I try to add session. I have also tried with part of the session and even renaming session and type in case they where reserved words.
BUT, I checked my original post and the sql is f*cked up. This is the real create statement.
create table payload (
payload struct<id:string,jsonrpc:string,
params:
struct<device_info:
struct<model:string>,
session:
struct<id:
struct<type:string,value:string>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
Hi again rcongiu,
Not sure if there is anything wrong with my installation but when I create the table as you suggest I get the error that I presented above when I do "select payload.params from testjson2".
Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:604) at java.util.ArrayList.get(ArrayList.java:382) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
But if I remove the device_id part and only use the parts below it works as expected. (It also works if I use only device_id instead of model) Can you succesfully do the same select statement?
Create table testjson2( payload struct <jsonrpc:string,
id:string,
params:struct<
device_info:struct<model:string>
>
> )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
Ok, I have done some more tests and verified two problems (in my environment at least):
1: I am unable to have more than one value inside a struct
OK
struct_name:struct<name:type>
MR job fails, stacktrace in tasktracker logs
struct_name:struct<name:type,name2,type>
2012-09-12 08:17:23,241 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201209041503_0109_m_000000_2 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"jsonrpc":"2.0","id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","params":{"session":{"id":"someid","value":"somevalue"}}}}
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"jsonrpc":"2.0","id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","params":{"session":{"id":"someid","value":"somevalue"}}}}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
... 8 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:604)
at java.util.ArrayList.get(ArrayList.java:382)
at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381)
at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:569)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)
... 9 more
2: I am unable to succesfully use more than one struct inside a struct In my earlier examples, I can successfully replace device_id with session so the error occurs when there is more than one struct inside the params struct.
What happens is that it seems to work properly when searching WHERE session.id = but the result is not returned in the response.
There is no problem creating the tables or inserting the data, the problems occurs when you are doing the queries, select payload from testjson2
Have you verified that these two scenarios works for you. I really want to know if there is something wrong with my setup or if there is a bug somewhere...
I will test both of these cases. It is important to me to have faith in this solution, so I don't mind doing some more testing. I will try to get to this today.
Chuck
From: jrask [notifications@github.com] Sent: Wednesday, September 12, 2012 4:34 AM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Where clause matches but result is null (#10)
Ok, I have done some more tests and verified two problems (in my environment at least):
1: I am unable to have more than one value inside a struct
OK
struct_name:structname:type
MR job fails, stacktrace in tasktracker logs
struct_name:structname:type,name2,type
2012-09-12 08:17:23,241 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201209041503_0109_m_000000_2 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"jsonrpc":"2.0","id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","params":{"session":{"id":"someid","value":"somevalue"}}}} at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"jsonrpc":"2.0","id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","params":{"session":{"id":"someid","value":"somevalue"}}}} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143) ... 8 more Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:604) at java.util.ArrayList.get(ArrayList.java:382) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:569) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531) ... 9 more
2: I am unable to succesfully use more than one struct inside a struct In my earlier examples, I can successfully replace device_id with session so the error occurs when there is more than one struct inside the params struct.
What happens is that it seems to work properly when searching WHERE session.id = but the result is not returned in the response.
There is no problem creating the tables or inserting the data, the problems occurs when you are doing the queries, select payload from testjson2
Have you verified that these two scenarios works for you. I really want to know if there is something wrong with my setup or if there is a bug somewhere...
— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/10#issuecomment-8485775.
Dumb question... what trick are you guys using to put a literal block in the post, so github does not strip angle brackets?
Thanks! in your comment box, on the right you have "Comments are parsed with GutHub Flavoured markdown", there you can find all you need. The preview (next to Write) above starts coming in handy.
I wrote an expanded test, with nested structures. It works correctly. Details...
My JSON file is structs.json
{"lname":"smith", "fname":"bob", "address":{"street":"55 maple", "city":"boston", "state":"ma"}, "contact":{"phones":{"home":"123-555-1234", "office":"123-555-7890"}, "emails":{"home":"bob@home.org", "office":"bob@office.org"}}}
{"lname":"jones", "fname":"bob", "address":{"street":"44 maple", "city":"medford", "state":"ma"}, "contact":{"phones":{"home":"123-444-1234", "office":"123-444-7890"}, "emails":{"home":"jones@home.org", "office":"jones@office.org"}}}
{"lname":"greene", "fname":"bill", "address":{"street":"33 maple", "city":"revere", "state":"ma"}, "contact":{"phones":{"home":"123-333-1234", "office":"123-333-7890"}, "emails":{"home":"bill@home.org", "office":"bill@office.org"}}}
My Hive script is structs.hive
drop table t9;
create table t9
(lname string,
fname string,
address struct <street:string, city:string, state:string>,
contact struct <
phones:struct <home:string, office:string>,
emails:struct <home:string, office:string>
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
stored as textfile;
load data inpath '/tmp/structs.json' overwrite into table t9;
describe t9;
select address.street from t9 where fname='bob';
select fname, lname, contact.phones.office from t9 where address.state='ma';
select fname,lname from t9 where contact.emails.home='jones@home.org';
I started Hive with the command line:
hive --auxpath /usr/lib/hive/extra_libs -f structs.hive >structs.out
The output is:
Deleted /user/hive/warehouse/t9
lname string from deserializer
fname string from deserializer
address struct<street:string,city:string,state:string> from deserializer
contact struct<phones:struct<home:string,office:string>,emails:struct<home:string,office:string>> from deserializer
55 maple
44 maple
bob smith 123-555-7890
bob jones 123-444-7890
bill greene 123-333-7890
bob jones
which is correct.
Thanks! Your example work fine but I am still unable to run mine. I will try to use you example and slowly convert it to my structure. I will get back with my results.
I found that writing the Hive table definition for nested-nested structures was tricky. For plain JSON it is easy, you can just use a JSON viewer to play with the file until it is correct. But afaik there is no convenient "Hive table syntax checker" to help you with the table define.
From: jrask [notifications@github.com] Sent: Thursday, September 13, 2012 2:48 AM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Where clause matches but result is null (#10)
Thanks! Your example work fine but I am still unable to run mine. I will try to use you example and slowly convert it to my structure. I will get back with my results.
— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/10#issuecomment-8519631.
Hi, I am using this project to see if I can use it to query many large nested documents. So far I have only tried with small documents and I have a question.
Schema
create table fresspayload ( payload struct<id:string,jsonrpc:string, params:struct< device_info:struct,
session:struct<
id:struct<type:string,value:string
Json document:
{ "payload": { "jsonrpc": "2.0", "id": "0cd0f2f8-31b6-4038-9864-1bcc01dee188", "params": { "device_info": { "model": "HTC Desire", "device_id": "xxxx" }, "session": { "id": { "type": "msisdn", "value": "xxxxx" }, "type": "app" }, "received_at": "2012-08-28T00:07:21.858Z" }, "method": "connect" }, "protocol_version": "1.1", "route": { "source": { "id": { "type": "msisdn", "value": "xxxx" }, "type": "app" } } }
Once imported I run:
hive> select payload.params from fresspayload; Total MapReduce CPU Time Spent: 2 seconds 70 msec OK {"device_info":{"model":"HTC Desire"},"session":null} Time taken: 30.232 seconds
It seems like "session" is null
But when I write a Where clause with the session id it returns a hit, see below. hive> select payload.params from fresspayload where payload.params.session.id.value == "xxxx"; Total MapReduce CPU Time Spent: 2 seconds 230 msec OK {"device_info":{"model":"HTC Desire"},"session":null} Time taken: 30.117 seconds
But when I run with nonexisting id I do not get any matches. hive> select payload.params from fresspayload where payload.params.session.id.value = "non_existing"; MapReduce Total cumulative CPU time: 1 seconds 910 msec Ended Job = job_201209041503_0096 MapReduce Jobs Launched: Job 0: Map: 1 Accumulative CPU: 1.91 sec HDFS Read: 0 HDFS Write: 0 SUCESS Total MapReduce CPU Time Spent: 1 seconds 910 msec OK
Since the WHERE clause matches there must be something right with this... do you have any input concerning this?
Kind regards /Johan Rask