volga-project / volga

Feature Engine for real-time AI/ML
Apache License 2.0
36 stars 4 forks source link

[SE] Message data model/serialization (avro, protobuf, etc.) #15

Open anovv opened 5 months ago

anovv commented 5 months ago

Currently we use simple json ser/de in Python using simplejson and pass messages as encoded strings. This is not performant and does not allow to proper pass schema in case of cross lang communication (e.g. between Rust operators). This also blocks proper message serialization for checkpointing.

We need to have a proper data model for a message indicating:

  1. Schema
  2. Keys/Values/Timestamp fields
  3. Message type (Watermark, Checkpoint barrier, Regular)
  4. Metadata (receiver, sender, timestamps, traces, etc.)

See more on how Flink uses different serializers and their impact on perf https://flink.apache.org/2020/04/15/flink-serialization-tuning-vol.-1-choosing-your-serializer-if-you-can/

We should pick a proper ser/de protocol (Avro looks like a top choice for streaming). Specifically, with different serializers we can see a 10x impact on data throughput