jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

How to read binary encoded data #31

Closed vkuznet closed 8 years ago

vkuznet commented 8 years ago

Hi, we're trying to use hive over external data we store on HDFS. The data are in avro format but their content was binary encoded. We used python software to that and it happy provides a way to encode our data in binary format. Now, when I setup Hive table and pointed it to use our scheme it complains with the following exception:

Failed with exception java.io.IOException:java.io.IOException: Not a data file.
16/02/02 17:53:39 [main]: ERROR CliDriver: Failed with exception java.io.IOException:java.io.IOException: Not a data file.
java.io.IOException: java.io.IOException: Not a data file.
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1655)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:756)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Not a data file.
        at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
        at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
        at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.<init>(AvroGenericRecordReader.java:81)
        at org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat.getRecordReader(AvroContainerInputFormat.java:51)
        at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:667)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:323)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:445)
        ... 15 more

If I read exception right, it tries to use AvroGenericRecordReader class to read the data. Therefore I want to know if it is possible to instruct hive to use binary Avro encoder to read data stored in binary avro format. I'm not java expert and would appreciate if you'll provide as much details as possible that we can think how to address this problem. Thanks, Valentin.

jghoman commented 8 years ago

Not sure I understand. Haivvreo expects the data to be binary encoded (as opposed to JSON encoded), so it should already be doing what you're expecting. What file exactly is this trying to read. Also, keep in mind that Haivvreo's pretty old and all bug fixes are now in Hive's AvroSerde.

vkuznet commented 8 years ago

Jakob, forgive me if I write/ask stupid questions. I'm new to all the tools. We used python avro module which provides binary encoding. We write our files using this modules and files in avro format are stored in HDFS. Now we try to use hive external table to query our files. This is where I got the exception. The table is created using AvroSerde and yet we got exception. Valentin.

On 0, Jakob Homan notifications@github.com wrote:

Not sure I understand. Haivvreo expects the data to be binary encoded (as opposed to JSON encoded), so it should already be doing what you're expecting. What file exactly is this trying to read. Also, keep in mind that Haivvreo's pretty old and all bug fixes are now in Hive's AvroSerde.


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/31#issuecomment-178805765

jghoman commented 8 years ago

Nope, not stupid at all, just trying to understand. What you're describing should work. What DDL did you use to create the table and what files are in the directories? This exception is what one would get if there were non-Avro encoded files in the directory.

vkuznet commented 8 years ago

Ok, here is how I created table:

create database cms_wm_archive location '/user/valya/test/hive/arch.db';
use cms_wm_archive;

create external table wmarchive
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe"
STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"
location "/user/valya/test/hive/data/"
TBLPROPERTIES("avro.schema.url"="hdfs:///user/valya/test/hive/fwjr_processing.avsc");

and my HDFS content is the following:

hadoop fs -ls /user/valya/test/hive/
Found 3 items
drwxr-xr-x   - valya supergroup          0 2016-02-02 16:49 /user/valya/test/hive/arch.db
drwxr-xr-x   - valya supergroup          0 2016-02-02 20:15 /user/valya/test/hive/data
-rw-r--r--   3 valya supergroup     114367 2016-01-25 19:51 /user/valya/test/hive/fwjr_processing.avsc

the data directory contain avro files

hadoop fs -ls /user/valya/test/hive/data
Found 2 items
-rw-r--r--   3 valya supergroup       7179 2016-02-02 20:15 /user/valya/test/hive/data/1.avro
-rw-r--r--   3 valya supergroup       7179 2016-02-02 20:15 /user/valya/test/hive/data/2.avro

On 0, Jakob Homan notifications@github.com wrote:

Nope, not stupid at all, just trying to understand. What you're describing should work. What DDL did you use to create the table and what files are in the directories? This exception is what one would get if there were non-Avro encoded files in the directory.


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/31#issuecomment-178818294

jghoman commented 8 years ago

That all looks fine. The last thing to check is if your avro files are correct. Can you try reading them directly through Java, just to make sure they were encoded correctly? Here's a quick sample on how to do that.

vkuznet commented 8 years ago

Jakob, thanks for pointers. I was able to identify the problem, turns out it was corrupted avro files and indeed reading it from java helped to see that. I'm closing the ticket now, since everything is working. Best, Valentin.

On 0, Jakob Homan notifications@github.com wrote:

That all looks fine. The last thing to check is if your avro files are correct. Can you try reading them directly through Java, just to make sure they were encoded correctly? Here's a quick sample on how to do that.


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/31#issuecomment-178825689

jghoman commented 8 years ago

Awesome! Glad to help.