klarna / erlavro

Avro support for Erlang/Elixir (http://avro.apache.org/)
Apache License 2.0
131 stars 39 forks source link

OCF with reserved namespace name cannot be decoded due to sanity check #101

Open jbruggem opened 4 years ago

jbruggem commented 4 years ago

I'm trying to decode an OCF with invalid values for the namespace:

{"type":"record","name":"null","namespace":"null","fields":[{"name":"partition","type":"int"},  // [...]

Despite the namespace, erlavro would be able to decode it correctly except for a sanity check in avro_utils:579:

  ?ERROR_IF(lists:member(CanonicalName, ReservedNames),
            {reserved, Name, CanonicalName}).

Which produces:

     ** (ErlangError) Erlang error: {:reserved, "null", "null"}
     code: :avro_ocf.decode_file("redacted file path")
     stacktrace:
       src/avro_util.erl:579: :avro_util.verify_type_name/1
       src/avro_record.erl:90: :avro_record.type/3
       src/avro_json_decoder.erl:72: :avro_json_decoder.decode_schema/2
       src/avro_ocf.erl:77: :avro_ocf.decode_binary/1
       test/avro_archive_file_test.exs:14: (test)

Should that verification be here when deserializing existing data ? Even if "null" is not a value accepted by the spec, erlavro works fine if we bypass that check; it is able to read the OCF file correctly.

I don't know the code base enough to understand if it makes sense from that point of view. From a user of the lib's point of view, on the other hand, it seems regrettable to prevent erlavro from being able to decode a file correctly when it's capable of doing it, if you see what I mean. Maybe there could be a "strict mode" option ?

If relevant, we are ready to work on a PR to improve the situation.

mikpe commented 4 years ago

I'm not an expert on Avro, but it looks like the producer wanted to produce that entity with a null namespace, but did so erroneously using "null" instead of an empty string as the spec requires. So,

  1. What produced this broken data? Is it under your control or not?
  2. I would not drop the check on line 579, but instead map "null" to "" on line 568.
zmstone commented 4 years ago

The issue is the type's "name" field (not "namespace") being "null".

According to the spec

Primitive type names have no namespace and their names may not be defined in any namespace.

The problem of using the primitive type name for named types is when there is a reference to this type in the enclosed context, there is no way to know if it refers to the primitive type or the named type. e.g. (pseudo JSON):

{ type: record,
  name: null
  fields: [
    {name: filed1, type: union[null, int]}
  ]
}

There is no way to tell if the null for field1 is the primitive null or the record type named null. Note that Avro allows named types referenced by short name --- in which case the namespace is the closest enclosing one.

jbruggem commented 4 years ago

What produced this broken data? Is it under your control or not?

https://github.com/confluentinc/kafka-connect-hdfs - not entirely under my control, and in any case already produced :).

@zmstone your analysis, makes sense but what's your conclusion here ? Do you say that the fact that it's refusing to try to parse the OCF file is normal ?

If I may offer another interesting piece of information: avro's Java implementation, running in a Flink job, can read that same file without any kind of problem. It seems to me that it would be better to match the de facto reference implementation, wouldn't it ?

There is no way to tell if the null for field1 is the primitive null or the record type named null.

Then wouldn't it be better to fail when that happens ("you used null and you have a type named null, I don't know what to do"), rather than much earlier when the "null" name is used ?

edit: thank you both for your responsivness !

zmstone commented 4 years ago

Do you say that the fact that it's refusing to try to parse the OCF file is normal ?

Yes, normal, if we allow primitive type name for named types, it may work for you, but will fail others.

Then wouldn't it be better to fail when that happens ("you used null and you have a type named null, I don't know what to do"), rather than much earlier when the "null" name is used ?

Decoder will actually pick the primitive type to decode. If the data is actually for the named type, then it may result in failure (which will be hard to troubleshoot because of shitted bytes in a stream), or it may result in garbled decoding result which is worse than giving up early.

It seems to me that it would be better to match the de facto reference implementation, wouldn't it ?

It's maybe 'de facto', but it's also a bug, which should be fixed by confluent.

In the mean time (before confluent can fix it), there are workarounds:

  1. avro_ocf has the decode_binary API, so in the caller, you can read the file, rewrite its header and call this API to decode the blocks.
  2. You can fork it
jbruggem commented 4 years ago

Thanks for taking the time to reply to me.

It's maybe 'de facto', but it's also a bug, which should be fixed by confluent.

It would be the Apache Avro project's responsibility, rather. Though I highly doubt they'll change something that would make an unknown count of existing avro files unreadable :). And if they did, they would provide a way to keep existing files such as mine here readable. Those files exist, and beyond the spec the purpose of a lib is to match how things are in reality.

Let's say this was a CSV file reader lib rather than an avro file lib. Would this lib support only RFC4180 because that's the standard ? Probably not, because in practice many CSV files don't follow that spec -- it would make sense to make that lib useable to as many CSV use cases as possible.

If the data is actually for the named type, then it may result in failure (which will be hard to troubleshoot because of shitted bytes in a stream), or it may result in garbled decoding result which is worse than giving up early.

I understand that risk (though the second one is improbable), and it would of course make sense to have a "strict" versus "relaxed" mode for anybody worried about this kind of situation. In any case, since the file doesn't follow the spec, is it really a problem how it fails ? It's an invalid file anyway. Might as well do everything you can and let the user take the risk.

I'll implement my own OCF reader to circumvent this problem -- please understand my point of view though; it makes sense to have library also work in practice, not only in theory.

zmstone commented 4 years ago

Hi @jbruggem

It would be the Apache Avro project's responsibility, rather.

Is this because confluent uses Apache Avro's Java library? Then this is indeed something Avro Java library should fix: it should not allow creating such schema. Nonetheless, the schema can't be originated from Avro Java library. The source should be fixed to have a different name, even a random one should work. For example confluent schema registry uses a hardcoded default. e.g. https://github.com/confluentinc/schema-registry/blob/6916e84e0173b93fd0a12295a3d01c19c94081cc/avro-data/src/main/java/io/confluent/connect/avro/AvroData.java#L78-L82

Could you help me to understand from where exactly is your schema generated ?

please understand my point of view though; it makes sense to have library also work in practice, not only in theory.

Sure. erlavro has before accepted PRs to work with invalid schema such as #66 which had no potential harm to accept. I understand this might be a blocking issue for you and you may already have a monkey-patch fix in place ? Maybe you can send your fix as a PR, but I would really like to know the full story of this schema's origin before accepting a fix which may turn out to be unnecessary if it could easily be fixed somewhere else.