Closed joscha closed 2 months ago
Closing this for now, I think it needs an additional fix in the bucket selection for getValueBucket
, see https://github.com/mtth/avsc/blob/7cb76a6eb7b0d5b5a71c9cfb446acf3eff763006/lib/types.js#L1261
For my current project's, very specific use-case (avro schemas generated from an openapi definition and then the data queried via a generated client as well, so the class names match) this version works:
function getValueBucket(val) {
if (val === null) {
return 'null';
}
var bucket = typeof val;
if (bucket === 'object') {
// Could be bytes, fixed, array, map, or record.
if (Array.isArray(val)) {
return 'array';
} else if (Buffer.isBuffer(val)) {
return 'buffer';
}
if (val.constructor.name) {
return 'model.' + val.constructor.name;
};
}
return bucket;
}
however
if (val.constructor.name) {
return 'model.' + val.constructor.name;
};
doesn't transfer to raw JSON. The current namespace would need to be passed into the UnwrappedUnion
class as well, in order to qualify the type.
Generally, there is a discriminator (an attribute named type
) on the containing object and the object to serialize, which can be used to discriminate, but that would need a lot more introspection on the types. I wonder if a custom resolver in getValueBucket
would be acceptable, so the serialization implementation can decide? Thoughts?
Reopened this with the proper implementation respecting the namespace and inferred arrays holding objects. Would be great to get some feedback on this and see if this is the right direction. If yes, will add tests of course. All current tests pass with this change.
Unions of named types are already supported using wrapped unions. Can you share some context on your use-case to understand why they aren't a good fit?
Note also that you can already force an unwrapped representation for arbitrary unions using logical types (see here, and the linked comment).
Unions of named types are already supported using wrapped unions. Can you share some context on your use-case to understand why they aren't a good fit?
Of course. I am using this API from Affinity CRM (OpenAPI spec at the top or a current extract here) to generate a Typescript client with openapi-generator/typescript. You can see the implementation of it here.
Our current Affinity CRM instance has multiple 10 thousand items in it of the Company
type.
A Company
type can have an arbitrary number of fields with each field having one of many different value types (see here). The overall data is about 1.4 GB of raw JSON.
I am using openapi-generator/avro-schema to generate an Avro schema for the same data.
I then read the data in batches of 100 from the API and stream them directly through avsc
on disk using a file encoder.
From there I import them into Snowflake.
I want the data in Snowflake to be:
I also want to:
field.value = {
'model.TextValue': field.value
}
in order to adhere to the rule that the value needs to sit in an object with a single key of its type name
Does that make sense?
Note also that you can already force an unwrapped representation for arbitrary unions using logical types (see here, and the linked comment).
Yes, I am currently using wrapUnions: 'never'
, however the unions I have do almost certainly not need wrapping. I believe the auto
value is currently not deciding correctly. With the fix in the value buckets I don't need to enforce using unwrapped unions, the automatic detection does it right.
Thank you for the context.
This change will not help you since you are uploading Avro-encoded files to Snowflake. All unions have the same Avro encoding, independent of their in-memory representation. You will have to check on the Snowflake side to see how they represent the data.
W.r.t. the logic, a major concern is that it only works with values with a named constructor. No other values currently have this requirement.
You will have to check on the Snowflake side to see how they represent the data.
with 'never'
I seem to just get a plain JSON object. I haven't tried wrapped unions.
W.r.t. the logic, a major concern is that it only works with values with a named constructor. No other values currently have this requirement.
Yes, hence the guard.
Yes, hence the guard.
It's not sufficient unfortunately. It would break many users of unwrapped unions, for example the common nullable case:
const type = avro.Type.forSchema([
'null',
{type: 'record', name: 'test.Foo', fields: [{name: 'id', type: 'string'}]},
]);
type.toBuffer({id: 'abc'}); // Not valid anymore
(We should add a test for this.)
Another issue is that the library should be self-contained: any constructors required for use with unions should be included in--or generated by--the library. Record constructors are already generated but meant to be opt-in. They also aren't compatible with this implementation, which means that certain values don't roundtrip anymore:
const val = type.fromBuffer(Buffer.from('0206616263', 'hex')); // Generated record instance
type.toBuffer(val); // Throws
For this feature to be worth adding, I think it needs to allow arbitrary record representations. Even if we assume the use of constructors to disambiguate branches, there is no obvious single way to name them. Another user would be justified in requesting that their own records, named slightly differently, also be supported.
Here is one way we could achieve this:
unwrapUnions
option to also accept a function, which is expected to return a disambiguation function (e.g. (types: ReadonlyArray<Type>) => ((val: unknown) => number) | undefined
) or nothing.This single function gives users full flexibility to decide how to disambiguate unions. In your case, it could be a simple constructor check using the naming convention from the PR. Other users would be free to use their own constructor naming convention, or something else entirely.
WDYT?
Yes, hence the guard.
It's not sufficient unfortunately. It would break many users of unwrapped unions, for example the common nullable case:
Another issue is that the library should be self-contained: any constructors required for use with unions should be included in--or generated by--the library. Record constructors are already generated but meant to be opt-in. They also aren't compatible with this implementation, which means that certain values don't roundtrip anymore:
For this feature to be worth adding, I think it needs to allow arbitrary record representations. Even if we assume the use of constructors to disambiguate branches, there is no obvious single way to name them. Another user would be justified in requesting that their own records, named slightly differently, also be supported.
Okay, I get your point, thank you for elaborating. Reading the constructor name is definitely not a great implementation option anyway.
Here is one way we could achieve this:
- We allow the
unwrapUnions
option to also accept a function, which is expected to return a disambiguation function (e.g.(types: ReadonlyArray<Type>) => ((val: unknown) => number) | undefined
) or nothing.- This function is called at schema parsing time on each union with its branches' types. If it returns a non-null (function) value, that function will be called each time a value's branch needs to be inferred and should return the branch's index. In this case the union will use an unwrapped representation. Otherwise, the union will be wrapped.
This single function gives users full flexibility to decide how to disambiguate unions. In your case, it could be a simple constructor check using the naming convention from the PR. Other users would be free to use their own constructor naming convention, or something else entirely.
WDYT?
I like it. This is what we would use instead of the result from getValueBucket
in those cases when a projection function has been supplied AND returns a defined value, correct?
(We should add a test for this.)
I can do this when I open the pull request for the above?
I believe the above implementation suggestion can also solve https://github.com/mtth/avsc/pull/225#issuecomment-473723338 - you can simply return the discriminator of the union to ensure there is no wrapping needed. (cc @hath995 - might be a bit late for you possibly)
I like it. This is what we would use instead of the result from getValueBucket in those cases when a projection function has been supplied AND returns a defined value, correct?
That's right.
I can do this when I open the pull request for the above?
A separate tiny PR adding the test would be best.
A separate tiny PR adding the test would be best.
@mtth PTAL
Will you also want this backported to
5.x
? (It will probably be at least a few weeks before6.0
is released.)
I wasn't going to - given it's not a fix, I think it could reasonably be expected to upgrade if the feature is needed? Do you think we need to?
Thanks @joscha!
I think it could reasonably be expected to upgrade if the feature is needed? Do you think we need to?
I was checking in case you needed it to be backported, to use before 6.0
is released. Otherwise, yes - no reason to.
@mtth are you able to publish a new 6.x alpha version which includes this change by any chance, please?
@joscha - done: 6.0.0-alpha.14
(Consider pinning, there will be breaking changes in later alpha releases.)
The Avro specification defines:
see https://avro.apache.org/docs/1.11.1/specification/#unions
meaning that if a
record
type has a name, it is valid to mix multiple of them in a union, unwrapped.The same change should possibly be made for enums as well.