mtth / avsc

Avro for JavaScript :zap:
MIT License
1.29k stars 148 forks source link

feat: allow named types in unions #469

Closed joscha closed 2 months ago

joscha commented 2 months ago

The Avro specification defines:

Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum

see https://avro.apache.org/docs/1.11.1/specification/#unions

meaning that if a record type has a name, it is valid to mix multiple of them in a union, unwrapped.

The same change should possibly be made for enums as well.

joscha commented 2 months ago

Closing this for now, I think it needs an additional fix in the bucket selection for getValueBucket, see https://github.com/mtth/avsc/blob/7cb76a6eb7b0d5b5a71c9cfb446acf3eff763006/lib/types.js#L1261

joscha commented 2 months ago

For my current project's, very specific use-case (avro schemas generated from an openapi definition and then the data queried via a generated client as well, so the class names match) this version works:

function getValueBucket(val) {
  if (val === null) {
    return 'null';
  }
  var bucket = typeof val;
  if (bucket === 'object') {
    // Could be bytes, fixed, array, map, or record.
    if (Array.isArray(val)) {
      return 'array';
    } else if (Buffer.isBuffer(val)) {
      return 'buffer';
    }
    if (val.constructor.name) {
      return 'model.' + val.constructor.name;
    };
  }
  return bucket;
}

however

    if (val.constructor.name) {
      return 'model.' + val.constructor.name;
    };

doesn't transfer to raw JSON. The current namespace would need to be passed into the UnwrappedUnion class as well, in order to qualify the type.

Generally, there is a discriminator (an attribute named type) on the containing object and the object to serialize, which can be used to discriminate, but that would need a lot more introspection on the types. I wonder if a custom resolver in getValueBucket would be acceptable, so the serialization implementation can decide? Thoughts?

joscha commented 2 months ago

Reopened this with the proper implementation respecting the namespace and inferred arrays holding objects. Would be great to get some feedback on this and see if this is the right direction. If yes, will add tests of course. All current tests pass with this change.

mtth commented 2 months ago

Unions of named types are already supported using wrapped unions. Can you share some context on your use-case to understand why they aren't a good fit?

Note also that you can already force an unwrapped representation for arbitrary unions using logical types (see here, and the linked comment).

joscha commented 2 months ago

Unions of named types are already supported using wrapped unions. Can you share some context on your use-case to understand why they aren't a good fit?

Of course. I am using this API from Affinity CRM (OpenAPI spec at the top or a current extract here) to generate a Typescript client with openapi-generator/typescript. You can see the implementation of it here.

Our current Affinity CRM instance has multiple 10 thousand items in it of the Company type. A Company type can have an arbitrary number of fields with each field having one of many different value types (see here). The overall data is about 1.4 GB of raw JSON.

I am using openapi-generator/avro-schema to generate an Avro schema for the same data.

I then read the data in batches of 100 from the API and stream them directly through avsc on disk using a file encoder. From there I import them into Snowflake.

I want the data in Snowflake to be:

I also want to:

Does that make sense?

Note also that you can already force an unwrapped representation for arbitrary unions using logical types (see here, and the linked comment).

Yes, I am currently using wrapUnions: 'never', however the unions I have do almost certainly not need wrapping. I believe the auto value is currently not deciding correctly. With the fix in the value buckets I don't need to enforce using unwrapped unions, the automatic detection does it right.

mtth commented 2 months ago

Thank you for the context.

This change will not help you since you are uploading Avro-encoded files to Snowflake. All unions have the same Avro encoding, independent of their in-memory representation. You will have to check on the Snowflake side to see how they represent the data.

W.r.t. the logic, a major concern is that it only works with values with a named constructor. No other values currently have this requirement.

joscha commented 2 months ago

You will have to check on the Snowflake side to see how they represent the data.

with 'never' I seem to just get a plain JSON object. I haven't tried wrapped unions.

W.r.t. the logic, a major concern is that it only works with values with a named constructor. No other values currently have this requirement.

Yes, hence the guard.

mtth commented 2 months ago

Yes, hence the guard.

It's not sufficient unfortunately. It would break many users of unwrapped unions, for example the common nullable case:

const type = avro.Type.forSchema([
  'null',
  {type: 'record', name: 'test.Foo', fields: [{name: 'id', type: 'string'}]},
]);

type.toBuffer({id: 'abc'}); // Not valid anymore

(We should add a test for this.)

Another issue is that the library should be self-contained: any constructors required for use with unions should be included in--or generated by--the library. Record constructors are already generated but meant to be opt-in. They also aren't compatible with this implementation, which means that certain values don't roundtrip anymore:

const val = type.fromBuffer(Buffer.from('0206616263', 'hex')); // Generated record instance
type.toBuffer(val); // Throws

For this feature to be worth adding, I think it needs to allow arbitrary record representations. Even if we assume the use of constructors to disambiguate branches, there is no obvious single way to name them. Another user would be justified in requesting that their own records, named slightly differently, also be supported.

Here is one way we could achieve this:

  1. We allow the unwrapUnions option to also accept a function, which is expected to return a disambiguation function (e.g. (types: ReadonlyArray<Type>) => ((val: unknown) => number) | undefined) or nothing.
  2. This function is called at schema parsing time on each union with its branches' types. If it returns a non-null (function) value, that function will be called each time a value's branch needs to be inferred and should return the branch's index. In this case the union will use an unwrapped representation. Otherwise, the union will be wrapped.

This single function gives users full flexibility to decide how to disambiguate unions. In your case, it could be a simple constructor check using the naming convention from the PR. Other users would be free to use their own constructor naming convention, or something else entirely.

WDYT?

joscha commented 2 months ago

Yes, hence the guard.

It's not sufficient unfortunately. It would break many users of unwrapped unions, for example the common nullable case:

Another issue is that the library should be self-contained: any constructors required for use with unions should be included in--or generated by--the library. Record constructors are already generated but meant to be opt-in. They also aren't compatible with this implementation, which means that certain values don't roundtrip anymore:

For this feature to be worth adding, I think it needs to allow arbitrary record representations. Even if we assume the use of constructors to disambiguate branches, there is no obvious single way to name them. Another user would be justified in requesting that their own records, named slightly differently, also be supported.

Okay, I get your point, thank you for elaborating. Reading the constructor name is definitely not a great implementation option anyway.

Here is one way we could achieve this:

  1. We allow the unwrapUnions option to also accept a function, which is expected to return a disambiguation function (e.g. (types: ReadonlyArray<Type>) => ((val: unknown) => number) | undefined) or nothing.
  2. This function is called at schema parsing time on each union with its branches' types. If it returns a non-null (function) value, that function will be called each time a value's branch needs to be inferred and should return the branch's index. In this case the union will use an unwrapped representation. Otherwise, the union will be wrapped.

This single function gives users full flexibility to decide how to disambiguate unions. In your case, it could be a simple constructor check using the naming convention from the PR. Other users would be free to use their own constructor naming convention, or something else entirely.

WDYT?

I like it. This is what we would use instead of the result from getValueBucket in those cases when a projection function has been supplied AND returns a defined value, correct?

(We should add a test for this.)

I can do this when I open the pull request for the above?

joscha commented 2 months ago

I believe the above implementation suggestion can also solve https://github.com/mtth/avsc/pull/225#issuecomment-473723338 - you can simply return the discriminator of the union to ensure there is no wrapping needed. (cc @hath995 - might be a bit late for you possibly)

mtth commented 2 months ago

I like it. This is what we would use instead of the result from getValueBucket in those cases when a projection function has been supplied AND returns a defined value, correct?

That's right.

I can do this when I open the pull request for the above?

A separate tiny PR adding the test would be best.

joscha commented 2 months ago

A separate tiny PR adding the test would be best.

https://github.com/mtth/avsc/pull/477

joscha commented 2 months ago

@mtth PTAL

joscha commented 2 months ago

Will you also want this backported to 5.x? (It will probably be at least a few weeks before 6.0 is released.)

I wasn't going to - given it's not a fix, I think it could reasonably be expected to upgrade if the feature is needed? Do you think we need to?

mtth commented 2 months ago

Thanks @joscha!

mtth commented 2 months ago

I think it could reasonably be expected to upgrade if the feature is needed? Do you think we need to?

I was checking in case you needed it to be backported, to use before 6.0 is released. Otherwise, yes - no reason to.

joscha commented 1 month ago

@mtth are you able to publish a new 6.x alpha version which includes this change by any chance, please?

mtth commented 1 month ago

@joscha - done: 6.0.0-alpha.14

(Consider pinning, there will be breaking changes in later alpha releases.)