kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

Kite Avro SDK: Merge Can Create 'type' Lists with Default Not as First Element #446

Open jake-greene opened 8 years ago

jake-greene commented 8 years ago

From the Avro spec

Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing "null", the "null" is usually listed first, since the default value of such unions is typically null.

However, merging two record schemas does not always enforce this rule. For example: "type":["string","null"], "default":null which should be "type":["null", "string"], "default":null. I can reproduce this bug with the following:

  1. A Json String with a null value for key X
  2. A Json String with a non-null value for key X
  3. A Json String with no entry for X (DNE)
  4. Schemas inferred from each of the sample Json Strings
  5. The schemas merged in the following way: merge(non-null, merge(null, dne)) or merge(non-null, merge(dne, null))
scala> val nul = """{"key":null}"""
nul: String = {"key":null}

scala> val dne = """{"other":3}"""
dne: String = {"other":3}

scala> val str = """{"key":"hello"}"""
str: String = {"key":"hello"}

scala> def stream(s: String): InputStream = new ByteArrayInputStream(s.getBytes("UTF-8"))
stream: (s: String)java.io.InputStream

scala> val nulSchema = JsonUtil.inferSchema(stream(nul), "com.example", 1)
nulSchema: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"key","type":"null","doc":"Type inferred from 'null'"}]}

scala> val dneSchema = JsonUtil.inferSchema(stream(dne), "com.example", 1)
dneSchema: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"other","type":"int","doc":"Type inferred from '3'"}]}

scala> val nPlusDne = SchemaUtil.merge(dneSchema, nulSchema)
nPlusDne: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"other","type":["null","int"],"doc":"Type inferred from '3'","default":null},{"name":"key","type":"null","doc":"Type inferred from 'null'","default":null}]}

scala> val strSchema = JsonUtil.inferSchema(stream(str), "com.example", 1)
strSchema: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"key","type":"string","doc":"Type inferred from '\"hello\"'"}]}

scala> val merged = SchemaUtil.merge(strSchema, nPlusDne)
[WARNING] Avro: Invalid default for field key: null not a ["string","null"]
merged: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"key","type":["string","null"],"doc":"Type inferred from '\"hello\"'","default":null},{"name":"other","type":["null","int"],"doc":"Type inferred from '3'","default":null}]}

The final merge produces "type":["string","null"], "default":null, despite the type of the default value needing to be the first element of the type list.

mkwhitacre commented 8 years ago

@jake-greene issues for Kite are usually tracked on the projects JIRA instance.[1]

[1] - https://issues.cloudera.org/projects/KITE

jake-greene commented 8 years ago

Thank you, @mkwhitacre. Should I create an issue there and close this one?

mkwhitacre commented 8 years ago

Probably would be good so this issue doesn't get forgotten.