jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

How does schema migration work? #21

Closed IllyaYalovyy closed 12 years ago

IllyaYalovyy commented 12 years ago

Environment: CDH3U3, avro-1.5.4 and avro-mapred-1.5.4, haivvreo-1.0.7

  1. Create table with partitions and Avro schema:
    
    {
    "namespace":"com.expedia.edw.gco.navigator",
    "name":"item",
    "type":"record",
    "fields":[
     {"name":"a","type":["null","string"]},
     {"name":"b","type":["null","int"]}
    ]
    }
    
  2. Populate this table with data
  3. Change Avro schema: add a new field:
    
    {
    "namespace":"com.expedia.edw.gco.navigator",
    "name":"item",
    "type":"record",
    "fields":[
     {"name":"a","type":["null","string"]},
     {"name":"a1","type":["null","string"]},
     {"name":"b","type":["null","int"]}
    ]
    }
    
  4. Execute simple query:
    
    select \* from illya_avro_1 where load_tag=3 limit 10;
    
  5. Exception:
     
    FATAL ExecMapper: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@14d921a
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:520)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
    Caused by: org.apache.avro.AvroTypeException: Found {
    "type" : "record",
    "name" : "item",
    "namespace" : "com.expedia.edw.gco.navigator",
    "fields" : [ {
     "name" : "a",
     "type" : [ "null", "string" ]
    }, {
     "name" : "b",
     "type" : [ "null", "int" ]
    } ]
    }, expecting {
    "type" : "record",
    "name" : "item",
    "namespace" : "com.expedia.edw.gco.navigator",
    "fields" : [ {
     "name" : "a",
     "type" : [ "null", "string" ]
    }, {
     "name" : "a1",
     "type" : [ "null", "string" ]
    }, {
     "name" : "b",
     "type" : [ "null", "int" ]
    } ]
    }

  1. Expected result: query returns result set where new field is populated with NULL
jghoman commented 12 years ago

@IllyaYalovyy You need to update the schema.literal or schema.url with the new schema via "ALTER TABLE table_name SET SERDEPROPERTIES ('schema.literal' = 'BLAH');" or, if you use schema.url either update what the url is pointing to, or update to a new url.

IllyaYalovyy commented 12 years ago
  1. means the schema was updated at the location "schema.url" is pointing to. It cause exception that is shown in 5. Is there any work around for this case?

Thanks, Illya

jghoman commented 12 years ago

Ah, for newly added fields you have to add a default value in the schema that old records return: http://avro.apache.org/docs/1.5.4/spec.html under complex types/records.

jghoman commented 12 years ago

@IllyaYalovyy Did using a default work for you?

IllyaYalovyy commented 12 years ago

Jakob,

I run a lot of tests yesterday and it worked like a charm. Thanks a lot for your support.

IllyaYalovyy commented 12 years ago

Here is my final schema: { "namespace":"com.expedia.edw.gco.navigator", "name":"item", "type":"record", "fields":[ {"name":"a","type":["null","string"]}, {"name":"a1","type":["null","string"], "default":null}, {"name":"b","type":["null","int"]}, {"name":"b1","type":["null","int"], "default":null}, {"name":"c", "type": {"type":"record", "name":"c_t", "fields":[{"name":"col1","type":["null","string"]}]}}, {"name":"c1", "type": {"type":"record", "name":"c1_t", "fields":[{"name":"col1","type":["null","string"], "default":null}]}, "default":null}, {"name":"d", "type":{"type":"array", "items":{"type":"record", "name":"da_t", "fields":[{"name":"col1","type":["null","string"], "default":null}]}}, "default":null} ] }