Open jbruggem opened 4 years ago
I'm not an expert on Avro, but it looks like the producer wanted to produce that entity with a null namespace, but did so erroneously using "null"
instead of an empty string as the spec requires. So,
"null"
to ""
on line 568.The issue is the type's "name"
field (not "namespace"
) being "null"
.
According to the spec
Primitive type names have no namespace and their names may not be defined in any namespace.
The problem of using the primitive type name for named types is when there is a reference to this type in the enclosed context, there is no way to know if it refers to the primitive type or the named type. e.g. (pseudo JSON):
{ type: record,
name: null
fields: [
{name: filed1, type: union[null, int]}
]
}
There is no way to tell if the null
for field1
is the primitive null
or the record type named null
.
Note that Avro allows named types referenced by short name
--- in which case the namespace is the closest enclosing one.
What produced this broken data? Is it under your control or not?
https://github.com/confluentinc/kafka-connect-hdfs - not entirely under my control, and in any case already produced :).
@zmstone your analysis, makes sense but what's your conclusion here ? Do you say that the fact that it's refusing to try to parse the OCF file is normal ?
If I may offer another interesting piece of information: avro's Java implementation, running in a Flink job, can read that same file without any kind of problem. It seems to me that it would be better to match the de facto reference implementation, wouldn't it ?
There is no way to tell if the null for field1 is the primitive null or the record type named null.
Then wouldn't it be better to fail when that happens ("you used null and you have a type named null
, I don't know what to do"), rather than much earlier when the "null" name is used ?
edit: thank you both for your responsivness !
Do you say that the fact that it's refusing to try to parse the OCF file is normal ?
Yes, normal, if we allow primitive type name for named types, it may work for you, but will fail others.
Then wouldn't it be better to fail when that happens ("you used null and you have a type named null, I don't know what to do"), rather than much earlier when the "null" name is used ?
Decoder will actually pick the primitive type to decode. If the data is actually for the named type, then it may result in failure (which will be hard to troubleshoot because of shitted bytes in a stream), or it may result in garbled decoding result which is worse than giving up early.
It seems to me that it would be better to match the de facto reference implementation, wouldn't it ?
It's maybe 'de facto', but it's also a bug, which should be fixed by confluent.
In the mean time (before confluent can fix it), there are workarounds:
avro_ocf
has the decode_binary
API, so in the caller, you can read the file, rewrite its header and call this API to decode the blocks.Thanks for taking the time to reply to me.
It's maybe 'de facto', but it's also a bug, which should be fixed by confluent.
It would be the Apache Avro project's responsibility, rather. Though I highly doubt they'll change something that would make an unknown count of existing avro files unreadable :). And if they did, they would provide a way to keep existing files such as mine here readable. Those files exist, and beyond the spec the purpose of a lib is to match how things are in reality.
Let's say this was a CSV file reader lib rather than an avro file lib. Would this lib support only RFC4180 because that's the standard ? Probably not, because in practice many CSV files don't follow that spec -- it would make sense to make that lib useable to as many CSV use cases as possible.
If the data is actually for the named type, then it may result in failure (which will be hard to troubleshoot because of shitted bytes in a stream), or it may result in garbled decoding result which is worse than giving up early.
I understand that risk (though the second one is improbable), and it would of course make sense to have a "strict" versus "relaxed" mode for anybody worried about this kind of situation. In any case, since the file doesn't follow the spec, is it really a problem how it fails ? It's an invalid file anyway. Might as well do everything you can and let the user take the risk.
I'll implement my own OCF reader to circumvent this problem -- please understand my point of view though; it makes sense to have library also work in practice, not only in theory.
Hi @jbruggem
It would be the Apache Avro project's responsibility, rather.
Is this because confluent uses Apache Avro's Java library? Then this is indeed something Avro Java library should fix: it should not allow creating such schema. Nonetheless, the schema can't be originated from Avro Java library. The source should be fixed to have a different name, even a random one should work. For example confluent schema registry uses a hardcoded default. e.g. https://github.com/confluentinc/schema-registry/blob/6916e84e0173b93fd0a12295a3d01c19c94081cc/avro-data/src/main/java/io/confluent/connect/avro/AvroData.java#L78-L82
Could you help me to understand from where exactly is your schema generated ?
please understand my point of view though; it makes sense to have library also work in practice, not only in theory.
Sure. erlavro
has before accepted PRs to work with invalid schema such as #66 which had no potential harm to accept.
I understand this might be a blocking issue for you and you may already have a monkey-patch fix in place ?
Maybe you can send your fix as a PR, but I would really like to know the full story of this schema's origin before accepting a fix which may turn out to be unnecessary if it could easily be fixed somewhere else.
I'm trying to decode an OCF with invalid values for the namespace:
Despite the namespace, erlavro would be able to decode it correctly except for a sanity check in avro_utils:579:
Which produces:
Should that verification be here when deserializing existing data ? Even if "null" is not a value accepted by the spec, erlavro works fine if we bypass that check; it is able to read the OCF file correctly.
I don't know the code base enough to understand if it makes sense from that point of view. From a user of the lib's point of view, on the other hand, it seems regrettable to prevent erlavro from being able to decode a file correctly when it's capable of doing it, if you see what I mean. Maybe there could be a "strict mode" option ?
If relevant, we are ready to work on a PR to improve the situation.