vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.6k stars 1.55k forks source link

protobuf framing: support varint prefix (wire format) #20156

Open lspgn opened 6 months ago

lspgn commented 6 months ago

A note for the community

Use Cases

Thank you for this software!

When source or sinks make use of protobuf encoding/decoding, the ability to decode protowire is missing. When serializing protobuf, the go official library is suggesting to prefix them with a varint, treating the message like another nested message (without tag though).

Some tools like ClickHouse are making use of length prefixed messages (eg: when consuming from Kafka):

ClickHouse inputs and outputs protobuf messages in the length-delimited format. It means before every message should be written its length as a varint. See also how to read/write length-delimited protobuf messages in popular languages.

I would like to suggest adding such framing option.

Attempted Solutions

Currently, Vector offers two ways of decoding protobuf with framing: byte or length_delimited.

In certain cases, the source making use of a byte framing (eg: the buffer in a socket, file sources), there are risks a protobuf message may be "cut" or skipped (2 batched messages, only first one is decoded, rest is discarded). Furthermore, a default/zero-length protobuf would be missed.

The length_delimited setting is not necessarily standard for protobuf and is not retro-compatible with varint.

sources:
  example:
    type: socket
    mode: unix_stream
    path: "mysock.socket"
    decoding:
      codec: protobuf
      protobuf:
        desc_file: "abc.desc"
        message_type: "abc.ABC"
    framing:
      method: length_delimited # needs a uint32 prefix

Unfortunately, it's not possible to create a "wrapper" protobuf message since the tag (1 in the example below) must be encoded as well as varint:

message DEF {
  repeated ABC abc = 1;
}

Proposal

My suggestion would be the following for sources and sinks.

Either having the protobuf decoder assume it will read a varint and consider it a length. This said, not sure if this could be one-to-many way of decoding messages (+ waiting for the rest of the bytes).

sources:
  example:
    ...
    decoding:
      codec: protobuf
      protobuf:
        desc_file: "abc.desc"
        message_type: "abc.ABC"
        protowire: true
    framing:
      method: byte

or having a proper varint in framing:

sources:
  example:
    ...
    decoding:
      codec: protobuf
      protobuf:
        desc_file: "abc.desc"
        message_type: "abc.ABC"
    framing:
      method: varint

Thank you!

References

No response

Version

vector 0.36.1 (2857180 2024-03-11 14:32:52.417737479)

jszwedko commented 6 months ago

Thanks for opening this @lspgn ! I was unaware of the use of varint framing with protobuf messages (I'd only seen length delimited). I think we'd be happy to see a PR introducing this. I'd suggest modeling it as the framing option since I think that would be the most consistent with the existing codec model.

lspgn commented 6 months ago

Thank you @jszwedko, I'm not familiar at all with Rust but I assume this would be in https://github.com/vectordotdev/vector/blob/master/lib/codecs/src/decoding/framing/mod.rs

jszwedko commented 6 months ago

Yeah, that's right. It could be modeled after the length delimited framer: https://github.com/vectordotdev/vector/blob/master/lib/codecs/src/decoding/framing/length_delimited.rs