jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

AvroSerdeException when importing data to avro table #28

Open ramasLTU opened 11 years ago

ramasLTU commented 11 years ago

Hi, I have problems with a simple task:

  1. create hive table (stored as textfile compressed with bz2)
  2. import that table to partitioned (and compressed) avro table

Here is a short tale of me hitting the wall. Maybe you can identify which turn do I miss..

  1. create text table:

CREATE EXTERNAL TABLE sample( number int, text string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/root/sample/'

  1. create some text file with couple rows in a fashion: 1 row1 2 row2 3 row3
  2. compress that file with bz2, upload and check that table returns when selected: SELECT * FROM sample; Works like charm for me.
  3. create a partitioned, avro table:

CREATE TABLE sample_avro PARTITIONED BY (number int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "my.sample", "name": "sample_avro", "type": "record", "fields": [ { "name":"text","type":"string"}] }')

  1. import data to table: SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.compress.output=true; INSERT INTO TABLE sample_avro partition (number) SELECT text, number FROM sample;

This is the moment when bad things happen... In the log i can see:

13/06/04 09:52:28 INFO exec.MoveTask: Partition is: {number=null} 13/06/04 09:52:28 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither avro.schema.literal nor avro.schema.url specified, can't determine table schema at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:66) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrReturnErrorSchema(AvroSerdeUtils.java:87) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:59) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:249) at org.apache.hadoop.hive.ql.metadata.Partition.getDeserializer(Partition.java:251) at org.apache.hadoop.hive.ql.metadata.Partition.initialize(Partition.java:217) at org.apache.hadoop.hive.ql.metadata.Partition.(Partition.java:107) at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:1500) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1195) at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1271) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:259) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1374) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1160) at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState.execute(BeeswaxServiceImpl.java:344) at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState$1$1.run(BeeswaxServiceImpl.java:609) at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState$1$1.run(BeeswaxServiceImpl.java:598) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:337) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1388) at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState$1.run(BeeswaxServiceImpl.java:598) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)

When selecting from sample_avro i get similar exception..

wagnermarkd commented 11 years ago

Hi,

I assume you're running Hive 0.11: Unfortunately there's a bug in 0.11 that breaks partitioned Avro tables. The table properties for the schema don't get passed to the SerDe properly, leading to the exception about missing avro.schema.*. There's a JIRA open to track the issue here: https://issues.apache.org/jira/browse/HIVE-3953. This will be fixed in the next release.

Thanks, Mark

On Wed, Jun 5, 2013 at 1:41 AM, ramasLTU notifications@github.com wrote:

Hi, I have problems with a simple task:

  1. create hive table (stored as textfile compressed with bz2)
  2. import that table to partitioned (and compressed) avro table

Here is a short tale of me hitting the wall. Maybe you can identify which turn do I miss..

  1. create text table:

CREATE EXTERNAL TABLE sample( number int, text string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/root/sample/'

  1. create some text file with couple rows in a fashion: 1 row1 2 row2 3 row3
  2. compress that file with bz2, upload and check that table returns when selected: SELECT * FROM sample; Works like charm for me.
  3. create a partitioned, avro table:

CREATE TABLE sample_avro PARTITIONED BY (number int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "my.sample", "name": "sample_avro", "type": "record", "fields": [ { "name":"text","type":"string"}] }')

  1. import data to table: SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.compress.output=true; INSERT INTO TABLE sample_avro partition (number) SELECT text, number FROM sample;

This is the moment when bad things happen... In the log i can see:

13/06/04 09:52:28 INFO exec.MoveTask: Partition is: {number=null} 13/06/04 09:52:28 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither avro.schema.literal nor avro.schema.url specified, can't determine table schema at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:66) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrReturnErrorSchema(AvroSerdeUtils.java:87) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:59) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:249) at org.apache.hadoop.hive.ql.metadata.Partition.getDeserializer(Partition.java:251) at org.apache.hadoop.hive.ql.metadata.Partition.initialize(Partition.java:217) at org.apache.hadoop.hive.ql.metadata.Partition.(Partition.java:107) at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:1500) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1195) at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1271) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:259) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1374) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1160) at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState.execute(BeeswaxServiceImpl.java:344) at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState$1$1.run(BeeswaxServiceImpl.java:609) at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState$1$1.run(BeeswaxServiceImpl.java:598) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:337) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1388) at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState$1.run(BeeswaxServiceImpl.java:598) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)

When selecting from sample_avro i get similar exception..

— Reply to this email directly or view it on GitHubhttps://github.com/jghoman/haivvreo/issues/28 .