infochimps-labs / wonderdog

Bulk loading for elastic search
http://infochimps.com
Apache License 2.0
186 stars 56 forks source link

Wonderdog fails in Pig 0.10? #6

Closed rjurney closed 11 years ago

rjurney commented 12 years ago

---------- Forwarded message ---------- From: Russell Jurney russell.jurney@gmail.com Date: Fri, Jun 22, 2012 at 4:05 PM Subject: Weird problem in Pig 0.10 with STOR'ing JSON and then LOADing it as PigStorage chararray To: user@pig.apache.org

The script that has worked in the past is thus:

/* Load Avro emails */ emails = load '/me/tmp/emails_big' using AvroStorage(); emails = filter emails by message_id IS NOT NULL;

/* JSONify the emails for ElasticSearch */ store emails into '/tmp/emails.json' using JsonStorage();

/* LOAD JSON as single field for storage in ElasticSearch with Wonderpig */ json_emails = load '/tmp/emails.json' using PigStorage() AS (json_record:chararray); store json_emails into 'es://email/email?id=message_id&json=true&size=1000' using ElasticSearch();

Now I get this error:

grunt> json_emails = load '/tmp/emails.json' AS (json_record:chararray);
2012-06-22 15:45:34,136 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left is "json_record:chararray", right is "message_id:chararray,thread_id:chararray,in_reply_to:chararray,subject:chararray,body:chararray,date:chararray,froms:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},tos:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},ccs:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},bccs:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},reply_tos:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)}" 2012-06-22 15:45:34,136 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "json_record:chararray", right is "message_id:chararray,thread_id:chararray,in_reply_to:chararray,subject:chararray,body:chararray,date:chararray,froms:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},tos:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},ccs:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},bccs:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)},reply_tos:bag{ARRAY_ELEM:tuple(real_name:chararray,address:chararray)}" at org.apache.pig.newplan.logical.relational.LogicalSchema.merge(LogicalSchema.java:760) at org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:114) at org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100) at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:219) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.visitor.CastLineageSetter.(CastLineageSetter.java:57) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1635) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1566) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1538) at org.apache.pig.PigServer.registerQuery(PigServer.java:540) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:490) at org.apache.pig.Main.main(Main.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I tried copying the file from /tmp/emails.json to /tmp/json_emails and loading it then - but that doesn't work. I tried calling PigStorage() explicitly, and that doesn't work either.

How am I supposed to pull this off?

I figured it out:

grunt> rm /tmp/emails.json/.pig_header grunt> rm /tmp/emails.json/.pig_schema

Then I can load my JSON as chararray. Interesting problem.

Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

rjurney commented 12 years ago

Fixed by https://github.com/infochimps-labs/wonderdog/pull/8