jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

Schema mismatch when running INSERT OVERWRITE from one Avro-backed table to another #14

Open elibingham opened 12 years ago

elibingham commented 12 years ago

Hi there,

I have two tables defined roughly as:

CREATE EXTERNAL TABLE a ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ( 'schema.url'='hdfs://schema/a.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/foo/a';

CREATE EXTERNAL TABLE b ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ( 'schema.literal'='{ ... }') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/foo/b';

When I do:

INSERT OVERWRITE TABLE b SELECT a, b, c FROM a;

(where b's schema is equivalent to a, b, c)

This fails with an AvroTypeException like this:

2012-02-21 11:31:58,587 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201202091933_0059_m_0000001 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@92696c2 at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@92696c2 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:520) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143) ... 8 more Caused by: org.apache.avro.AvroTypeException: Found { "type" : "record", "name" : "b", "fields" : [ { … /* schema of b / … } ] }, expecting { "type" : "record", "name" : "a", "fields" : [ { … /_ schema of a */ … } ], ... } at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:162) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at com.linkedin.haivvreo.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:79) at com.linkedin.haivvreo.AvroDeserializer.deserialize(AvroDeserializer.java:121) at com.linkedin.haivvreo.AvroSerDe.deserialize(AvroSerDe.java:81) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:504) ... 9 more

Any help would be appreciated.

jghoman commented 12 years ago

Which version of haivvreo are you running? Is it against the Avro 1.4 or 1.6 branch?

elibingham commented 12 years ago

Jakob,

Hive 0.7.1, Haivvreo 1.0.7, Avro 1.5.4 (CDH3)

Thanks for any assistance.

-Eli

On Feb 22, 2012, at 2:41 PM, Jakob Homan wrote:

Which version of haivvreo are you running? Is it against the Avro 1.4 or 1.6 branch?


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/14#issuecomment-4118081

jghoman commented 12 years ago

You're most likely running into an issue that was fixed on the Avro 1.4 branch: eca88ef764a1c48f7a50c935c59269ff645cab9c I'm planning on merging these changes to trunk this weekend.

elibingham commented 12 years ago

Jakob,

Ah, I see that branch is 7 changes ahead of master. Thanks for the heads up - can you let me know when the changes are merged so I can try out the new code?

Also, will master work with Avro 1.5.4 or will I need to move to 1.6.x?

-Eli

On Feb 24, 2012, at 4:40 PM, Jakob Homan wrote:

You're most likely running into an issue that was fixed on the Avro 1.4 branch: eca88ef764a1c48f7a50c935c59269ff645cab9c I'm planning on merging these changes to trunk this weekend.


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/14#issuecomment-4165704

jghoman commented 12 years ago

It should work with 1.6. That's one of the things I'm checking. The other commits are minor. The plan is to merge this code to the Hive source, since I don't have enough time to really care for this code right now.

elibingham commented 12 years ago

Any idea when this fix will be pushed? I'm getting bitten by this bug in another context (can't join together two tables).

elibingham commented 12 years ago

FYI, I cloned the repo, merged the 1.4 branch into master and built it myself to test out this patch. The problem still seems to exist on seemingly simple joins from a to be with non-matching schemas.

thirstycrow commented 12 years ago

I got the same problem and solved it by commenting out the following line in AvroGenericRecordReader.java, which set the schema of the target table as the reader's schema of GenericDatumReader.

if(latest != null) gdr.setExpected(latest);
elibingham commented 12 years ago

@thirstycrow Nice one - just tried this fix in my own fork at https://github.com/elibingham/haivvreo/commit/134c68914e4cd3fa8e47a48b3d0dc5bc609527a4 - worked like a charm.

cwsteinbach commented 12 years ago

@jghoman: is this patch ready to go in or are there some remaining issues?

elibingham commented 12 years ago

@cwsteinbach If you'd like you can clone my fork and build from there - it includes both @jghoman and @thirstycrow 's patches in master, and I've been using in my development environment for awhile without problems.

jghoman commented 12 years ago

The remaining issue is that I've not seen this in our clusters, so I want to look further, but haven't had time. I've been promised time next week to catch up Haivvreo issues (including the port to ASF), so will likely take a look then.

jghoman commented 12 years ago

ok. Finally have some time to focus on Haivvreo. The issue at hand: that line was added as part of 'be smart about reading in the latest version of the datum so we don't have to re-encode later.' Removing means that if there are any schema evolution changes, we'll have to (try to) re-encode. But you're right that it's getting picked up during writing as well, which is a regression. Now the trick is to figure out how to avoid re-encode and not screw up writing. Just commenting out this line isn't enough because the benefits of avoiding re-encoding are significant. Will take a look.

jghoman commented 12 years ago

@elibingham I'm not sure why you were seeing this on Hive 0.7. We just went to Hive 0.8 and saw it. In 0.8 at least, Hive has changed what it's passing around in the properties, the code was thinking it was in non-mr jobs all the time and loading the schema from that, which was what was causing the failure. I've committed a change (5407836a551540df13a94f50d357d741c8ddfd89) that should fix this. I've verified it against our 0.8.1 and 0.7.1 clusters. I'll try it against the insert into scenario you originally describe today as well and report back.

I've also turned up the logging info when we re-encode to warning because at this point there really is no reason we should be doing so. If we do, I'd like to know about it so we can fix it.

Take it for a spin and see if it works for you. And again, sorry for the delay in being able to look at this.

elibingham commented 12 years ago

Thanks for looking into this @jghoman. I'll give it a whirl, but I'm fairly certain it's not going to work on our cluster, which is hive 0.7.1-cdh3u2 based. Possibly the CDH3 version of 0.7.1 has some unexpected patches in it? Anyway, the commit only appears to be in the hive8 branch - do you expect that this jar will work with an 0.7.1 cluster? Thanks!

elibingham commented 12 years ago

Oh, also, when building the HEAD of avro8 branch, I got a unit test failure:

Results :

Tests in error: determineSchemaCanReadSchemaFromHDFS(com.linkedin.haivvreo.TestHaivvreoUtils): java.io.IOException: java.lang.NoClassDefFoundError: javax/ws/rs/core/StreamingOutput

Tests run: 56, Failures: 0, Errors: 1, Skipped: 0

jghoman commented 12 years ago

It's also in the avro14 branch: https://github.com/jghoman/haivvreo/commits/avro14 I merged it to the hive8 branch as well, but tested it to our hive 0.7.1 cluster today too. I'll check on that unit test failure.

jghoman commented 12 years ago

I fixed the test failure on the hive8 branch. For some reason hadoop 1.0.2 has a dependency on jersey that's not included on its published pom.

jghoman commented 12 years ago

@elibingham With the latest commits (bringing the avro14 branch up to 1.0.10), I've got both our 0.7.1 and 0.8.1 clusters correctly insert overwriting an equivalent query to yours. Definitely check it out and see if it works for you. Tomorrow I'll be merging the master and 1.4 branches (or starting to...) so we don't have this extra layer of annoyance going on.

elibingham commented 12 years ago

@jghoman Great. Our cluster is on Avro 1.6, so I think I need to wait until you merge to master, which is Avro 1.5/1.6 based?

jghoman commented 12 years ago

@elibingham I've pushed the changes to Master, but don't have a cluster with Avro 1.5 to test it on, so you'll have to be the guinea pig...

elibingham commented 12 years ago

@jghoman I built and tested with HEAD of master and I'm still seeing the problem (vs. with my patched jar). I've sanitized my CLI input and the Hadoop logs and hopefully this will give you a better idea of what's going on. The interesting part to me in the hadoop logs is that it looks like it's deserializing data from BAR but expecting the schema of FOO in the actual job. I've obscured some of the underlying data details and left out some irrelevant parts of the logs, let me know if you need anything else.

Sanitized hive CLI input:

hive> select a.i, a.j, a.k, a.l, a.m, b.n

from foo a join bar b on a.k = b.k limit 100; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 2 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201202091933_0242, Tracking URL = http://namenode:50030/jobdetails.jsp?jobid=job_201202091933_0242 Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=namenode:8021 -kill job_201202091933_0242 2012-04-24 15:23:29,258 Stage-1 map = 0%, reduce = 0% 2012-04-24 15:23:39,306 Stage-1 map = 1%, reduce = 0% 2012-04-24 15:23:45,336 Stage-1 map = 4%, reduce = 0% 2012-04-24 15:23:48,353 Stage-1 map = 6%, reduce = 0% 2012-04-24 15:23:50,364 Stage-1 map = 7%, reduce = 0% 2012-04-24 15:23:51,369 Stage-1 map = 8%, reduce = 0% 2012-04-24 15:23:53,380 Stage-1 map = 9%, reduce = 0% 2012-04-24 15:23:54,385 Stage-1 map = 12%, reduce = 0% 2012-04-24 15:23:57,398 Stage-1 map = 14%, reduce = 0% 2012-04-24 15:23:59,409 Stage-1 map = 15%, reduce = 0% 2012-04-24 15:24:00,418 Stage-1 map = 17%, reduce = 0% 2012-04-24 15:24:03,431 Stage-1 map = 19%, reduce = 0% 2012-04-24 15:24:05,441 Stage-1 map = 20%, reduce = 0% 2012-04-24 15:24:06,447 Stage-1 map = 21%, reduce = 0% 2012-04-24 15:24:09,463 Stage-1 map = 24%, reduce = 0% 2012-04-24 15:24:11,473 Stage-1 map = 25%, reduce = 0% 2012-04-24 15:24:12,478 Stage-1 map = 27%, reduce = 0% 2012-04-24 15:24:14,489 Stage-1 map = 28%, reduce = 0% 2012-04-24 15:24:15,494 Stage-1 map = 29%, reduce = 0% 2012-04-24 15:24:18,507 Stage-1 map = 30%, reduce = 0% 2012-04-24 15:24:20,516 Stage-1 map = 31%, reduce = 0% 2012-04-24 15:24:21,521 Stage-1 map = 32%, reduce = 2% 2012-04-24 15:24:23,530 Stage-1 map = 33%, reduce = 2% 2012-04-24 15:24:24,535 Stage-1 map = 34%, reduce = 4% 2012-04-24 15:24:27,550 Stage-1 map = 35%, reduce = 4% 2012-04-24 15:24:29,560 Stage-1 map = 36%, reduce = 4% 2012-04-24 15:24:32,575 Stage-1 map = 37%, reduce = 4% 2012-04-24 15:24:33,581 Stage-1 map = 38%, reduce = 4% 2012-04-24 15:24:35,590 Stage-1 map = 39%, reduce = 4% 2012-04-24 15:24:36,595 Stage-1 map = 40%, reduce = 4% 2012-04-24 15:24:38,603 Stage-1 map = 41%, reduce = 4% 2012-04-24 15:24:41,617 Stage-1 map = 43%, reduce = 4% 2012-04-24 15:24:44,631 Stage-1 map = 44%, reduce = 4% 2012-04-24 15:24:45,635 Stage-1 map = 100%, reduce = 100% Ended Job = job_201202091933_0242 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask hive>

Sanitized Hadoop log from job:

2012-04-24 16:08:09,962 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [..] 2012-04-24 16:08:10,085 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2012-04-24 16:08:10,477 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Unable to match filesplit /data/bar/20120221.avro.snappy:67108864+67108864 with a partition. 2012-04-24 16:08:10,706 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 2 2012-04-24 16:08:10,712 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2012-04-24 16:08:10,760 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2012-04-24 16:08:10,760 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2012-04-24 16:08:10,766 INFO ExecMapper: maximum memory = 200998912 [..] 2012-04-24 16:08:10,881 INFO ExecMapper:

Id =8 Id =2 Id =3 Id = 2 null<\Parent> <\RS> <\Children> Id = 8 null<\Parent> <\TS> <\Children> <\MAP> 2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Initializing Self 8 MAP 2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing Self 0 TS 2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Operator 0 TS initialized 2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing children of 0 TS 2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing child 1 RS 2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing Self 1 RS 2012-04-24 16:08:10,884 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Using tag = 0 2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Operator 1 RS initialized 2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initialization Done 1 RS 2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initialization Done 0 TS 2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing Self 2 TS 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Operator 2 TS initialized 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing children of 2 TS 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing child 3 RS 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing Self 3 RS 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Using tag = 1 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Operator 3 RS initialized 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initialization Done 3 RS 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initialization Done 2 TS 2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Initialization Done 8 MAP 2012-04-24 16:08:10,936 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Processing path /data/bar/bar_20120221.avro.snappy 2012-04-24 16:08:10,936 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias b for file hdfs://namenode/data/bar 2012-04-24 16:08:10,940 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 1 rows 2012-04-24 16:08:10,940 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 1 rows 2012-04-24 16:08:10,942 INFO ExecMapper: ExecMapper: processing 1 rows: used memory = 118554320 2012-04-24 16:08:10,958 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 10 rows 2012-04-24 16:08:10,958 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 10 rows 2012-04-24 16:08:10,958 INFO ExecMapper: ExecMapper: processing 10 rows: used memory = 119482744 2012-04-24 16:08:11,103 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 100 rows 2012-04-24 16:08:11,103 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 100 rows 2012-04-24 16:08:11,103 INFO ExecMapper: ExecMapper: processing 100 rows: used memory = 122267936 2012-04-24 16:08:11,490 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 1000 rows 2012-04-24 16:08:11,490 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 1000 rows 2012-04-24 16:08:11,491 INFO ExecMapper: ExecMapper: processing 1000 rows: used memory = 153848656 2012-04-24 16:08:12,369 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 10000 rows 2012-04-24 16:08:12,369 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 10000 rows 2012-04-24 16:08:12,369 INFO ExecMapper: ExecMapper: processing 10000 rows: used memory = 155906600 2012-04-24 16:08:17,641 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 100000 rows 2012-04-24 16:08:17,641 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 100000 rows 2012-04-24 16:08:17,641 INFO ExecMapper: ExecMapper: processing 100000 rows: used memory = 174357384 2012-04-24 16:08:25,784 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Unable to match filesplit /data/bar/bar_20120221.avro.snappy:201326592+67108864 with a partition. 2012-04-24 16:08:25,784 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Found the avro schema in the job: { SCHEMA OF FOO } 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 finished. closing... 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarded 237917 rows 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing... 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 0 rows 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 finished. closing... 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 forwarded 0 rows 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 Close done 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 finished. closing... 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarded 237917 rows 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 finished. closing... 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 forwarded 0 rows 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 Close done 2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 Close done 2012-04-24 16:08:25,802 INFO ExecMapper: ExecMapper: processed 237917 rows: used memory = 139110360 2012-04-24 16:08:25,807 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-04-24 16:08:25,811 WARN org.apache.hadoop.mapred.Child: Error running child org.apache.avro.AvroTypeException: Found { SCHEMA OF FOO }, expecting { SCHEMA OF BAR } at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:162) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at com.linkedin.haivvreo.AvroGenericRecordReader.next(AvroGenericRecordReader.java:119) at com.linkedin.haivvreo.AvroGenericRecordReader.next(AvroGenericRecordReader.java:41) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:98) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:42) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67) at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileRecordReader.next(Hadoop20SShims.java:208) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:208) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:193) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) 2012-04-24 16:08:25,814 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
jghoman commented 12 years ago

you're getting hit with the avro re-encoding stuff that's being discussed in issue #8. Try with the latest pushes. Beyond that, it's difficult to debug this. The code can't find a partition to match the schema to (which happens in issue #8) and then proceeds to find the stashed schema in the jobconf, as if it were being run from a select * statement. The correct thing would be for it to match the partition. You're running CDH Hive, right? I had no idea what changes they made and this bit of code (how to figure out if we're in an MR job or not) changes between versions.

IllyaYalovyy commented 12 years ago

Hello,

It looks like this issues is not resolved yet. Our environment is like follows: CDH3U3, avro-1.5.4 and avro-mapred-1.5.4, haivvreo-1.0.7

We are trying to execute following query: insert overwrite table LZ.TABLE1 select * from ETLDATA.TABLE2;

It constantly fails with following exception: Caused by: org.apache.avro.AvroRuntimeException: Not a record: ["null",{"type":"record","name":"_struct_d1e13","namespace":"com.expedia.edw.gco.navigator","fields":[{"name":"FirstName","type":["null","string"]},{"name":"FullName","type":["null","string"]},{"name":"TPID","type":["null","int"]},{"name":"TUID","type":["null","int"]}]}] at org.apache.avro.Schema.getFields(Schema.java:227) at com.linkedin.haivvreo.AvroSerializer.serializeStruct(AvroSerializer.java:104) at com.linkedin.haivvreo.AvroSerializer.serialize(AvroSerializer.java:97) at com.linkedin.haivvreo.AvroSerializer.serialize(AvroSerializer.java:65) at com.linkedin.haivvreo.AvroSerDe.serialize(AvroSerDe.java:75) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:553) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:78) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:527) ... 9 more

Both tables point on the same avro schema.

Is there any way we can fix it? It is a show stopper for our project.

Thanks, Illya

jghoman commented 12 years ago

sorry to hear that. I'm out of ideas on my end, since I can't replicate it. The way to avoid it is to turn off the re-encoding optimization that was mentioned before. It means on evolved records individual reading will be slower though. I'm finishing up the Haivvreo->Hive patch as I type, so hopefully someone in the Hive community may be able to reproduce and we can fix there (although that may be unnecessary because the newer version of Hive has tableproperties, which obviates the need for the all the dancing we're doing).

jghoman commented 12 years ago

@IllyaYalovyy Actually, yours is different. Looks like what it's encountering isn't even an Avro record. I'd recommend opening a new issue...

IllyaYalovyy commented 12 years ago

Created a new ticket with a detailed description and steps how to reproduce.