Extending a schema causes a "truncated buffer" error when using fromBuffer

acromarco commented 7 months ago

I use avsc version 5.7.7. When I try to deserialize from a buffer using a slightly extended schema, I get a "truncated buffer" error. I guess this is not expected?

it('should allow schema evolution', () => {
  const typeVersion1 = avro.Type.forSchema({
    type: 'record',
    name: 'Pet',
    fields: [{ name: 'name', type: 'string' }],
  });

  const dummyObjectToSerialize = { name: 'Albert' };
  const buf = typeVersion1.toBuffer(dummyObjectToSerialize);

  const deSerialized1 = typeVersion1.fromBuffer(buf);
  expect(deSerialized1).toEqual(dummyObjectToSerialize);

  const typeVersion2 = avro.Type.forSchema({
    type: 'record',
    name: 'Pet',
    fields: [
      { name: 'name', type: 'string' },
      { name: 'newField', type: 'string', default: 'myDefault' }, // Added new field with default
    ],
  });

  const deSerialized2 = typeVersion2.fromBuffer(buf); // throws "truncated buffer" error !!!
  expect(deSerialized2).toEqual({ ...dummyObjectToSerialize, newField: 'myDefault' });
});

mtth commented 7 months ago

Hi @acromarco. Reading data across compatible schemas requires a resolver. See https://github.com/mtth/avsc/issues/383#issuecomment-1081122339 for more information and an example.

acromarco commented 7 months ago

@mtth

Reading data across compatible schemas requires a resolver.

Thank you for your answer! Somehow I thought/expected that "basic schema evolution" works out the box. For me the documentation at https://github.com/mtth/avsc/wiki/Advanced-usage#schema-evolution created the impression that creating a resolver is only needed for special cases for increasing performance. Would it be technical possible using Avro to allow schema evolution without creating resolvers?

When I create a resolver it's now possible to read the old schema but not anymore the new schema:

const resolver = typeVersion2.createResolver(typeVersion1);

// works fine now, cool :-) !
const deSerialized2 = typeVersion2.fromBuffer(buf, resolver);
expect(deSerialized2).toEqual({ ...dummyObjectToSerialize, newField: 'myDefault' });

const dummyObjectToSerialize2 = { name: 'Albert', newField: 'myValue' };
const buf2 = typeVersion2.toBuffer(dummyObjectToSerialize2);

// works fine
const deSerialized3 = typeVersion2.fromBuffer(buf2);
expect(deSerialized3).toEqual(dummyObjectToSerialize2);

// throws "trailing data" error :-(
const deSerialized4 = typeVersion2.fromBuffer(buf2, resolver);
expect(deSerialized4).toEqual(dummyObjectToSerialize2);

Reading a buffer from a new schema using the resolver results in an "trailing data" error :-(.

So, what is the recommended way to decode an buffer that can be from multiple schema versions? Is it necessary to add a kind of "schema-version" field in order to use a resolver or not? This could get messy after some iterations of schema evolution.

Also what should I do when the reader don't know about the new schema? Just imagine an old client that tries to read data that is written with a new extended schema?

// throws "trailing data" error :-(
const deSerialized5 = typeVersion1.fromBuffer(buf2); // buf2 contains new extended schema
expect(deSerialized5).toEqual(dummyObjectToSerialize);

This fails also with the "trailing data" error. I would expect that this works automatically because all required fields are in the data. How should an old client knows about a new schema version in order to create resolvers?

Sorry for all the "dumb" questions. I'm new to Avro and maybe my expectations are wrong.

acromarco commented 7 months ago

@mtth

Is the following explanation correct?

Decoding avro-encoded data requires to know exactly the same schema that was used for encoding. This is caused by the efficient binary nature of avro. The encoded data doesn't contain enough "structural" or "metadata" that would allow a mapping/decoding to a slightly different (compatible) schema like one with additional optional fields. Therefore, a decoding client must create a resolver which is created from the encoding schema and the actual compatible client schema. This means that in practice, to support reading data from multiple and possibly unknown compatible schemas , the avro-encoded data needs to be accompanied directly by the encoding schema or a schema version and a method to look up the corresponding encoding schema (e.g. schema registry). Such a schema version must be provided outside the actual avro-encoded data, because otherwise there would be no way to read it.

mtth commented 7 months ago

Yes, that's right.

acromarco commented 7 months ago

Thank you! I will close this issue now, as it was never a bug but just my misunderstanding of how Avro works. However, maybe it's possible to make the documentation in the future more foolproof by:

Making more clear that you need for deserialization exactly the same schema used for serialization.
Adding more examples that are not mixed with optimization concerns. The current section https://github.com/mtth/avsc/wiki/Advanced-usage#schema-evolution was in this regard confusing for me.

mtth / avsc

Extending a schema causes a "truncated buffer" error when using fromBuffer #447