mtth / avsc

Avro for JavaScript :zap:
MIT License
1.27k stars 147 forks source link

Unwrapping unions when deserialising #451

Closed ftcvlad closed 5 months ago

ftcvlad commented 6 months ago

Hey, this is a bit of general avro question. According to the spec https://avro.apache.org/docs/1.10.2/spec.html#json_encoding when unions are serialised they are wrapped. But there is nothing about unwrapping in the spec, so, i guess avsc also doesnt implement it out of the box? I could do something with logical types like suggested here https://github.com/mtth/avsc/issues/129#issuecomment-329944723 , but it would couple services producing avro to the one deserialising it (yes, they share schema anyway, but now i would also need a logical type for every new union?)

What is the right approach here? To me it seems a little strange that serialisation modifies the data and there is no out of the box way to get the original data on the other end. Is it so, or am i missing something?

mtth commented 6 months ago

Hi @ftcvlad. JSON-encoded unions are wrapped but avsc's (and other libraries') in-memory representation can differ from this. This doesn't mean the data is modified by serialization: the data is simply represented differently, similar to how its buffer encoding is different from the in-memory representation (JSON just happens to be more similar and human-readable).

The best in-memory representation--unwrapped/wrapped unions or a custom logical type--will depend on your requirements. In general I would recommend avoiding the additional complexity of logical types unless the default representations are not sufficient. It's hard to say more without knowing more about your use-case.

ftcvlad commented 6 months ago

Hi @mtth , so my use case is

Service A -> pubsub -> Service B with avsc in it

pubsub validation expects data to be encoded / wrapped. Now in service B i want to get my unencoded, unwrapped, original representation data. Is there a way to achieve it with avsc? I was trying to do smth along

const unwrapped = type.clone(data, {
    fieldHook: (field, value, type) => {
      const isUnion = Type.isType(field.type, 'union');
      if (isUnion) {
        const keys = Object.keys(value);
        return value[keys[0]];
      }
      return value;
    }
});

But this feels a bit unsafe to mangle with data like this -- there might be extra steps to make it work for primitive types, fieldHook is not called for array items (e.g. if i have an array of union type it stays unwrapped), test etc. Is there a simpler / more conventional way? Or am i doing something unexpected?

ftcvlad commented 6 months ago

@mtth i ended up with

  const unwrapped = type.clone(data, {
    fieldHook: (field, value) => {
      const isArray = Type.isType(field.type, 'array');

      if (!isArray) {
        if (Type.isType(field.type, 'union:wrapped')) {
          const key = Object.keys(value)[0];

          return value[key];
        }

        return value;
      }

      const arrayItemType = (field.type as any).getItemsType();
      if (Type.isType(arrayItemType, 'union:wrapped')) {
        const res = [];
        for (let i = 0; i < value.length; i++) {
          const key = Object.keys(value[i])[0];

          res.push(value[i][key]);
        }

        return res;
      }

      return value;
    }
  });

which works for my data currently. But qns from above remain

mtth commented 6 months ago

There is no conventional way to do this as it's in general dangerous: not all unions can be safely unwrapped.

By default avsc will automatically unwrap most unions when it is safe to do so, including all optional fields. You would typically only unwrap further if you are using properties of your data, for example in https://github.com/mtth/avsc/issues/129#issuecomment-329944723.

If you have a need to tightly control your data's representation (maybe you have some internal APIs which do not work well with single value objects?), I would recommend creating an independent internal representation decoupled from the serialization layer's. You could include the snippet above along with any appropriate validation logic in it, giving you more confidence in the data and the ability to easily extend or swap it out later on.