jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

Force hive.output.file.extension=.avro in AvroSerDe could cause problem when hive.exec.compress.output=true #24

Open bewang-tech opened 12 years ago

bewang-tech commented 12 years ago

In AvroSerDe, hive.output.file.extension is forced to ".avro".

if(configuration == null) {
  LOG.info("Configuration null, not inserting schema");
} else {
  // force output files to have a .avro extension
  configuration.set("hive.output.file.extension", ".avro");
  configuration.set(HAIVVREO_SCHEMA, schema.toString(false));
}

If I query an Avro backed table or join an Avro backed table with a non-avro table, the result is in TextInputFormat and uses LazySimpleSerDe. This change usually won't cause problem until you set hive.exec.compress.out=true because TextInputFormat uses extensions to figure out the compression codec, and treat .avro as a plain text, but the file is deflate or snappy compressed.

You can reproduce like this:

hive> CREATE TABLE haivvreo_players
> COMMENT "test haivvreo tables"
> ROW FORMAT SERDE
> 'com.linkedin.haivvreo.AvroSerDe'
> WITH SERDEPROPERTIES (
>         'schema.literal'='{
>         "namespace": "com.linkedin.haivvreo",
>         "name": "players_schema",
>         "type": "record",
>         "fields": [ { "name":"id","type":"int"},
>         { "name":"user_name","type":"string"},
>         { "name":"age","type":"int"} ]
>         }')
> STORED AS INPUTFORMAT
> 'com.linkedin.haivvreo.AvroContainerInputFormat'
> OUTPUTFORMAT
> 'com.linkedin.haivvreo.AvroContainerOutputFormat';

Load some data into the table, then run the following commands:

hive> set hive.output.file.extension=false;
hive> select user_name, age from haivvreo_players;
....
OK
john    34
ben     15
jean    17
Time taken: 8.886 seconds

hive> set hive.output.file.extension=true;
hive> select user_name, age from haivvreo_players;
...
OK
x�      NULL
x�����c46�JJ�c44��JM��\L��      NULL
Time taken: 9.247 seconds