rcongiu / Hive-JSON-Serde

Read - Write JSON SerDe for Apache Hive.
Other
732 stars 390 forks source link

Where clause matches but result is null #10

Open jrask opened 11 years ago

jrask commented 11 years ago

Hi, I am using this project to see if I can use it to query many large nested documents. So far I have only tried with small documents and I have a question.

Schema

create table fresspayload ( payload struct<id:string,jsonrpc:string, params:struct< device_info:struct, session:struct< id:struct<type:string,value:string

) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;

Json document:

{ "payload": { "jsonrpc": "2.0", "id": "0cd0f2f8-31b6-4038-9864-1bcc01dee188", "params": { "device_info": { "model": "HTC Desire", "device_id": "xxxx" }, "session": { "id": { "type": "msisdn", "value": "xxxxx" }, "type": "app" }, "received_at": "2012-08-28T00:07:21.858Z" }, "method": "connect" }, "protocol_version": "1.1", "route": { "source": { "id": { "type": "msisdn", "value": "xxxx" }, "type": "app" } } }

Once imported I run:

hive> select payload.params from fresspayload; Total MapReduce CPU Time Spent: 2 seconds 70 msec OK {"device_info":{"model":"HTC Desire"},"session":null} Time taken: 30.232 seconds

It seems like "session" is null

But when I write a Where clause with the session id it returns a hit, see below. hive> select payload.params from fresspayload where payload.params.session.id.value == "xxxx"; Total MapReduce CPU Time Spent: 2 seconds 230 msec OK {"device_info":{"model":"HTC Desire"},"session":null} Time taken: 30.117 seconds

But when I run with nonexisting id I do not get any matches. hive> select payload.params from fresspayload where payload.params.session.id.value = "non_existing"; MapReduce Total cumulative CPU time: 1 seconds 910 msec Ended Job = job_201209041503_0096 MapReduce Jobs Launched: Job 0: Map: 1 Accumulative CPU: 1.91 sec HDFS Read: 0 HDFS Write: 0 SUCESS Total MapReduce CPU Time Spent: 1 seconds 910 msec OK

Since the WHERE clause matches there must be something right with this... do you have any input concerning this?

Kind regards /Johan Rask

ChuckConnell commented 11 years ago

Jrask -- I am also working with this SerDe, and would very much like to get it running. There is a typo in your post though. You said

But when I write hive> select payload.params from fresspayload where payload.params.session.id.value == "xxxx";

but you did not say what happens in this case...??

One issue to keep in mind is that there is not (I am fairly sure) a one-to-one correpondence between JSON datatypes and Hive datatypes. So it is sometimes tricky to match types.

ChuckConnell commented 11 years ago

Jrask -- I am frustrated trying to get this Serde to run at all. I know many other people have though, so I must have some small error in my setup. Please tell me:

Thank you, Chuck

jrask commented 11 years ago

Maybe I was unclear, I have updated the original issue.

I have built the jar myself from trunk I am running cdh4.0.1 so it seems like it is hive-0.8.1 I am using MRv1, not YARN.

rcongiu commented 11 years ago

Does your JSON data actually span multiple lines ? Every JSON record should be in a single line.

jrask commented 11 years ago

No, it was just for readability I pasted it in formatted. It is a single line (except that the formatting disappeared....)

ChuckConnell commented 11 years ago

Jrask -- The following is valid JSON, but there is no direct correspondence in Hive. Hive arrays must all be of one type.

{ "lname": "smith", "fname": "bob", "mixed-array": [2, 3, {"tag1": "a string" } ] }

jrask commented 11 years ago

Well... I do not have any mixed arrays. My guess is that there might be a problem with nested structs? I have also seen hive errors in the tasktracker logs depending on how I create my tables.

But when these errors occur I get an error in the MR job and no result back so this is not really part of this discussion.

2012-09-11 11:23:23,454 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201209041503_0080_m_000000_3 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","jsonrpc":"2.0","params":{"session":{"id":{"type":"msisdn","value":"xxx"},"type":"app"}}}} at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","jsonrpc":"2.0","params":{"session":{"id":{"type":"msisdn","value":"xxx"},"type":"app"}}}} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143) ... 8 more Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:604) at java.util.ArrayList.get(ArrayList.java:382) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:569) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)

ChuckConnell commented 11 years ago

It all works for me now (thanks to rcongiu straightening out my jar directory). I am using nested structs and an array within a struct. Here is my sample data:

{"lname" : "smith", "fname" : "bob", "array1" : [6,5,4], "obj1" : {"field1":"hello", "field2":"goodbye", "array2":[10,11,12]}} {"lname" : "jones", "fname" : "bob", "array1" : [1,2,3], "obj1" : {"field1":"this", "field2":"that", "array2":[13,14,15]}} {"lname" : "white", "fname" : "bill", "array1" : [9,8,7], "obj1" : {"field1":"yin", "field2":"yang", "array2":[16,17,18]}} {"lname" : "black", "fname" : "betty", "array1" : [10,11], "obj1" : {"field1":"bread", "field2":"butter", "array2":[19,21,22]}}

with the following table description:

lname string from deserializer fname string from deserializer array1 array from deserializer obj1 struct<field1:string, field2:string, array2:array> from deserializer

I can execute the query:

select fname,lname from t4 where obj1.array2[1]=17;

with the correct result:

bill white

ChuckConnell commented 11 years ago

(The table description above gets broken when reading the comment. It is correct in edit mode.)

jrask commented 11 years ago

Any tips on how I can get my example to work?

rcongiu commented 11 years ago

This worked for me (didn't reproduce the whole structure but should get you started):

Create table testjson2( payload struct <jsonrpc:string, id:string, params:struct< device_info:struct<model:string, device_id:string>

) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;

hive> select payload.jsonrpc, payload.params.device_info.model from testjson2;

OK 2.0 HTC Desire

jrask commented 11 years ago

Thanks, but Is that really the correct CREATE statement? There is no model inte the table but in the select and it ends with two > but no < ?

rcongiu commented 11 years ago

Create table testjson2( payload struct <jsonrpc:string, id:string, params:struct< device_info:struct<model:string, device_id:string>

) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;

rcongiu commented 11 years ago

mmm... github filters the > and <

rcongiu commented 11 years ago

 Create table testjson2(  payload struct <jsonrpc:string,
                                          id:string,
                                          params:struct<
                                                device_info:struct<model:string, device_id:string>
                                                >
                                           > )
        ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
ChuckConnell commented 11 years ago

I would simplify, and build it up in stages. Strip out most of the structures, get to a small sample that works, then gradually add more parts.

From: jrask [mailto:notifications@github.com] Sent: Tuesday, September 11, 2012 3:48 PM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Where clause matches but result is null (#10)

Any tips on how I can get my example to work?

— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/10#issuecomment-8470275.

jrask commented 11 years ago

Trust me, I have broken it down many times, and each time it fails when I try to add session. I have also tried with part of the session and even renaming session and type in case they where reserved words.

BUT, I checked my original post and the sql is f*cked up. This is the real create statement.

create table payload ( 
     payload struct<id:string,jsonrpc:string,
         params:
               struct<device_info:
                     struct<model:string>,
               session:
                     struct<id:
                          struct<type:string,value:string>>>>) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
jrask commented 11 years ago

Hi again rcongiu,

Not sure if there is anything wrong with my installation but when I create the table as you suggest I get the error that I presented above when I do "select payload.params from testjson2".

Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:604) at java.util.ArrayList.get(ArrayList.java:382) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)

But if I remove the device_id part and only use the parts below it works as expected. (It also works if I use only device_id instead of model) Can you succesfully do the same select statement?

Create table testjson2(  payload struct <jsonrpc:string,
                                          id:string,
                                          params:struct<
                                                device_info:struct<model:string>
                                                >
                                           > )
        ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
jrask commented 11 years ago

Ok, I have done some more tests and verified two problems (in my environment at least):

1: I am unable to have more than one value inside a struct

OK

 struct_name:struct<name:type>

MR job fails, stacktrace in tasktracker logs

 struct_name:struct<name:type,name2,type>

2012-09-12 08:17:23,241 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201209041503_0109_m_000000_2 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"jsonrpc":"2.0","id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","params":{"session":{"id":"someid","value":"somevalue"}}}}
    at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"jsonrpc":"2.0","id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","params":{"session":{"id":"someid","value":"somevalue"}}}}
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
    ... 8 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.rangeCheck(ArrayList.java:604)
    at java.util.ArrayList.get(ArrayList.java:382)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381)
    at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:569)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)
    ... 9 more

2: I am unable to succesfully use more than one struct inside a struct In my earlier examples, I can successfully replace device_id with session so the error occurs when there is more than one struct inside the params struct.

What happens is that it seems to work properly when searching WHERE session.id = but the result is not returned in the response.

There is no problem creating the tables or inserting the data, the problems occurs when you are doing the queries, select payload from testjson2

Have you verified that these two scenarios works for you. I really want to know if there is something wrong with my setup or if there is a bug somewhere...

ChuckConnell commented 11 years ago

I will test both of these cases. It is important to me to have faith in this solution, so I don't mind doing some more testing. I will try to get to this today.

Chuck


From: jrask [notifications@github.com] Sent: Wednesday, September 12, 2012 4:34 AM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Where clause matches but result is null (#10)

Ok, I have done some more tests and verified two problems (in my environment at least):

1: I am unable to have more than one value inside a struct

OK

struct_name:structname:type

MR job fails, stacktrace in tasktracker logs

struct_name:structname:type,name2,type

2012-09-12 08:17:23,241 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201209041503_0109_m_000000_2 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"jsonrpc":"2.0","id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","params":{"session":{"id":"someid","value":"somevalue"}}}} at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"payload":{"jsonrpc":"2.0","id":"0cd0f2f8-31b6-4038-9864-1bcc01dee188","params":{"session":{"id":"someid","value":"somevalue"}}}} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143) ... 8 more Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:604) at java.util.ArrayList.get(ArrayList.java:382) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:569) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531) ... 9 more

2: I am unable to succesfully use more than one struct inside a struct In my earlier examples, I can successfully replace device_id with session so the error occurs when there is more than one struct inside the params struct.

What happens is that it seems to work properly when searching WHERE session.id = but the result is not returned in the response.

There is no problem creating the tables or inserting the data, the problems occurs when you are doing the queries, select payload from testjson2

Have you verified that these two scenarios works for you. I really want to know if there is something wrong with my setup or if there is a bug somewhere...

— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/10#issuecomment-8485775.

ChuckConnell commented 11 years ago

Dumb question... what trick are you guys using to put a literal block in the post, so github does not strip angle brackets?

jrask commented 11 years ago

Thanks! in your comment box, on the right you have "Comments are parsed with GutHub Flavoured markdown", there you can find all you need. The preview (next to Write) above starts coming in handy.

ChuckConnell commented 11 years ago

I wrote an expanded test, with nested structures. It works correctly. Details...

My JSON file is structs.json

{"lname":"smith", "fname":"bob", "address":{"street":"55 maple", "city":"boston", "state":"ma"}, "contact":{"phones":{"home":"123-555-1234", "office":"123-555-7890"}, "emails":{"home":"bob@home.org", "office":"bob@office.org"}}}
{"lname":"jones", "fname":"bob", "address":{"street":"44 maple", "city":"medford", "state":"ma"}, "contact":{"phones":{"home":"123-444-1234", "office":"123-444-7890"}, "emails":{"home":"jones@home.org", "office":"jones@office.org"}}}
{"lname":"greene", "fname":"bill", "address":{"street":"33 maple", "city":"revere", "state":"ma"}, "contact":{"phones":{"home":"123-333-1234", "office":"123-333-7890"}, "emails":{"home":"bill@home.org", "office":"bill@office.org"}}}

My Hive script is structs.hive

drop table t9;

create table t9
(lname string, 
fname string,
address struct <street:string, city:string, state:string>,
contact struct <
    phones:struct <home:string, office:string>,
    emails:struct <home:string, office:string>
    >
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
stored as textfile;

load data inpath '/tmp/structs.json' overwrite into table t9;

describe t9;

select address.street from t9 where fname='bob';

select fname, lname, contact.phones.office from t9 where address.state='ma';

select fname,lname from t9 where contact.emails.home='jones@home.org';

I started Hive with the command line:

hive --auxpath /usr/lib/hive/extra_libs -f structs.hive >structs.out

The output is:

Deleted /user/hive/warehouse/t9

lname   string  from deserializer
fname   string  from deserializer
address struct<street:string,city:string,state:string>  from deserializer
contact struct<phones:struct<home:string,office:string>,emails:struct<home:string,office:string>>   from deserializer

55 maple
44 maple

bob smith   123-555-7890
bob jones   123-444-7890
bill    greene  123-333-7890

bob jones

which is correct.

jrask commented 11 years ago

Thanks! Your example work fine but I am still unable to run mine. I will try to use you example and slowly convert it to my structure. I will get back with my results.

ChuckConnell commented 11 years ago

I found that writing the Hive table definition for nested-nested structures was tricky. For plain JSON it is easy, you can just use a JSON viewer to play with the file until it is correct. But afaik there is no convenient "Hive table syntax checker" to help you with the table define.


From: jrask [notifications@github.com] Sent: Thursday, September 13, 2012 2:48 AM To: rcongiu/Hive-JSON-Serde Cc: Connell, Chuck Subject: Re: [Hive-JSON-Serde] Where clause matches but result is null (#10)

Thanks! Your example work fine but I am still unable to run mine. I will try to use you example and slowly convert it to my structure. I will get back with my results.

— Reply to this email directly or view it on GitHubhttps://github.com/rcongiu/Hive-JSON-Serde/issues/10#issuecomment-8519631.