linkedin / avro-util

Collection of utilities to allow writing java code that operates across a wide range of avro versions.
BSD 2-Clause "Simplified" License
76 stars 59 forks source link

Added serde customization support in Fast-Avro #520

Closed gaojieliu closed 1 year ago

gaojieliu commented 1 year ago

We have the following requirements: For serialization, we would like to validate whether all the map fields are using the desired map type. For deserialization, we would like to deserialize the map type into a special map impelementation for later use.

These customized requirements are not supported in the past because of the following reasons:

  1. Fast classes generated are shared, so it is possible different users of the same schema may have different requirement.
  2. For the same process, for different schema, the requirements can be different too.
  3. No way to specify customized logic/data type when generating fast classes.

This PR adds a new functionality to specify customized logic and it is expandable and backward compatible. DatumReaderCustomization : customization for read DatumWriterCustomization : customization for write

Currently, these classes only support the requirements mentioned at the beginning.

How it works internally?

  1. Each Fast DatumReader/DatumWriter constructor will take a new param for customization.
  2. Each Fast DatumReader/DatumWriter will keep a local vanilla-Avro based implementation with customization support since the shared vanilla-Avro based implementation is still the default implementation.
  3. Each generated Fast class will have a new param for customization in serialize/deserialize APIs.
  4. Fast DatumReader/DatumWriter will call this new API with customization param of Fast classes.
  5. The read/write API in Fast DatumReader/DatumWriter doesn't change, so it is backward compatible.
codecov-commenter commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (2fb570c) 45.77% compared to head (854aab3) 45.78%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #520 +/- ## ============================================ + Coverage 45.77% 45.78% +0.01% - Complexity 4440 4444 +4 ============================================ Files 403 403 Lines 28070 28070 Branches 4622 4622 ============================================ + Hits 12850 12853 +3 Misses 13664 13664 + Partials 1556 1553 -3 ``` [see 5 files with indirect coverage changes](https://app.codecov.io/gh/linkedin/avro-util/pull/520/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=linkedin)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

FelixGV commented 1 year ago

Thanks for adding back the generated classes @gaojieliu, but can we add them before the code change, so we can see the diff? i.e. these steps as separate commits:

  1. Tweak the ignore file to add back the generated code (perhaps in just one Avro version, or all version... either way may be fine...).
  2. Do the code change.
  3. Changes to the generated code.
  4. (Possibly optional) undo the tweak to the ignore file to re-ignore generated code...

On a separate note, @radai-rosenblatt, how strongly do you feel about not checking in the generated code? It makes it quite tedious, needing to do these git acrobatics in order to see the effect of the changes to the meta code...

gaojieliu commented 1 year ago

Thanks for adding back the generated classes @gaojieliu, but can we add them before the code change, so we can see the diff? i.e. these steps as separate commits:

  1. Tweak the ignore file to add back the generated code (perhaps in just one Avro version, or all version... either way may be fine...).
  2. Do the code change.
  3. Changes to the generated code.
  4. (Possibly optional) undo the tweak to the ignore file to re-ignore generated code...

On a separate note, @radai-rosenblatt, how strongly do you feel about not checking in the generated code? It makes it quite tedious, needing to do these git acrobatics in order to see the effect of the changes to the meta code...

It will be great to checkin one version of generated classes, otherwise the PR effort will be too high..

krisso-rtb commented 12 months ago

Hi @gaojieliu After upgrading to 0.3.21 we started observing errors:

java.lang.NullPointerException: Cannot invoke "org.apache.avro.Schema.equals(Object)" because "writer" is null
    at org.apache.avro.Schema.applyAliases(Schema.java:1890)
    at org.apache.avro.generic.GenericDatumReader.getResolver(GenericDatumReader.java:131)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
    at com.linkedin.avro.fastserde.FastSerdeUtils$FastDeserializerWithAvroSpecificImpl.deserialize(FastSerdeUtils.java:68)
    at com.linkedin.avro.fastserde.FastGenericDatumReader.read(FastGenericDatumReader.java:126)
    at org.apache.avro.file.DataFileStream.next(DataFileStream.java:263)
    at org.apache.avro.file.DataFileStream.next(DataFileStream.java:248)

Presence of the new class (FastSerdeUtils) pointed us to this PR... Has something changed?

We use

new FastSpecificDatumReader<>(null,  schema);

but it looks like null is not allowed anymore.

FelixGV commented 12 months ago

Thanks for reporting... that is weird, since I don't see changes to the FastSpecificDatumReader constructors... but... what does it mean to have a null writer schema, though? Are you expecting the writer schema to default the reader schema in that case?

krisso-rtb commented 12 months ago

OK, it's a bit more complex scenario :)

Schema schema = SomeAvroClass.getClassSchema();
FastSpecificDatumReader<SomeAvroClass> datumReader = new FastSpecificDatumReader<>(null, schema);

Path path = Paths.get("/Users/kris/dev/data-file-with-schema-in-header.avro");
try (InputStream inputStream = Files.newInputStream(path);
    DataFileStream<SomeAvroClass> reader = new DataFileStream<>(inputStream, datumReader)) {
    for (SomeAvroClass obj : reader) {
        System.out.println(obj);
    }
}

DataFileStream constructor calls void initialize(InputStream in, byte[] magic) method which extracts writerSchema from inputStream metadata. At the end the helper method invokes

reader.setSchema(header.schema); // replaces writerSchema if it's null --> that's our case

where reader is our datumReader.

Now, in version 0.3.21, new field was added to FastGenericDatumReader:

private final FastDeserializer<T> coldDeserializer;

which ignores invocation of public void setSchema(Schema schema) {

Basically setSchema(Schema schema) should be somehow cascaded to coldDeserializer.


I did an experiment and injecting

((FastSerdeUtils.FastDeserializerWithAvroSpecificImpl) this.coldDeserializer).customizedDatumReader.setSchema(schema);

to com.linkedin.avro.fastserde.FastGenericDatumReader.setSchema() fixed the issue. Of course the real fix should should be done in a better way (i.e. no casting).

FelixGV commented 12 months ago

Basically setSchema(Schema schema) should be somehow cascaded to coldDeserializer.

I did an experiment and injecting

((FastSerdeUtils.FastDeserializerWithAvroSpecificImpl) this.coldDeserializer).customizedDatumReader.setSchema(schema);

to com.linkedin.avro.fastserde.FastGenericDatumReader.setSchema() fixed the issue. Of course the real fix should should be done in a better way (i.e. no casting).

Ah, I see, that's very interesting... we should add a test for that (:

krisso-rtb commented 9 months ago

Hi @FelixGV @gaojieliu I provided a fix for that:

We also provided bugfix for another case:

krisso-rtb commented 6 months ago

Hi, It looks like the feature has a bug which should be fixed by: