klarna / erlavro

Avro support for Erlang/Elixir (http://avro.apache.org/)
Apache License 2.0
131 stars 39 forks source link

avro_json_encoder:encode_schema does not output Parsing Canonical Form for records #56

Closed reachfh closed 6 years ago

reachfh commented 6 years ago

I am using erlavro to generate Kafka messages, and am generating a schema fingerprint as described in https://avro.apache.org/docs/1.8.2/spec.html#schema_fingerprints (I have written an Erlang implementation of the Rabin fingerprint which I could contribute to erlavro, if you would like it.)

I want to generate the fingerprint based on the Parsing Canonical Form as described in https://avro.apache.org/docs/1.8.2/spec.html#Transforming+into+Parsing+Canonical+Form I am trying to do this by running

avro_json_encoder:encode_schema(avro_json_decoder:decode_schema(Json)

avro_json_encoder:encode_schema outputs fields in a different order for records, though. It should be:

"[ORDER] Order the appearance of fields of JSON objects as follows: name, type, fields, symbols, items, values, size. For example, if an object has type, name, and size fields, then the name field should appear first, followed by the type and then the size fields."

reachfh commented 6 years ago

I made a simple project with my fingerprint code: https://github.com/cogini/avro_fingerprint It's using mix to build (I mainly write Elixir these days). This runs the test cases from the Avro Java code. erlavro gives an error on test 016, "union should have at least one member type", so it's commented out. Otherwise it shows canonicalization differences.

zmstone commented 6 years ago

we would very much like to receive contribution, thanks in advance. I will add a avro:encode_schema/2 to accept an Opt to produce canonical form.

some notes: To produce canonical form, there are more work to do than fixing the JSON field orders.

  1. [PRIMITIVES] Logical type or custom type properties are to be stripped
  2. [FULLNAMES] Use full name everywhere
  3. [STRIP] Custom type properties and doc, aliases (type name aliases and record field name aliases) etc.
  4. [ORDER] You got it
  5. [STRINGS] I hope this does not mean record-default values for type fixed and bytes will have to be different in canonical form, will be pretty ugly otherwise.
  6. [INTEGERS] jsone default behavior
  7. [WHITESPACE] jsone default behavior
reachfh commented 6 years ago

Great. I will make a separate pull request with my avro_fingerprint code. I will try to make the changes for canonical form and make a pull request. Let me know how you would like it to work. My primary goal is that non-functional differences like whitespace and comments don't affect the hash, but of course it should be done properly for compatibility.

zmstone commented 6 years ago

I would like to have it fully compatible with spec for canonical form. Yes, PR is welcome. When do you think you'll be able send the PR for canonical form? (it doesn't have to be fully implemented before sending the pr, I like early involvement).

reachfh commented 6 years ago

I will probably do it in the next day or two.

zmstone commented 6 years ago

fixed in #64