Binary Parser: compound entry with repeat

0xMihir commented 1 year ago

Use Case

Using a new assignment called compound, the binary parser could parse structs that have repeated parts.

For example, the following struct contains a variable length of data readings from sensors to reduce the number of total packets sent.

typedef struct {
  float temperature;
  float humidity;
  unsigned long timestamp;
} sensor_reading_t;

typedef struct {
  short header;
  int reading_count;  
  char[] location;
  sensor_reading_t data[];
} multiple_data_packet_t;

Expected behavior

Using the binary parser, we can create an entry to parse the entire struct like so:

[[inputs.socket_listener]]
  service_address = "udp://:8094"

  endianess = "le"
  data_format = "binary"

  [[inputs.socket_listener.binary]]
    metric_name = "multi-example"
    entries = [
      { bits = 16, omit = true },
      { assignment = "tag", name = "location" },
      { assignment = "compound", repeat = {offset = 16, type = "uint32" },  entries = [
        { name = "temperature", type = "float32" },
        { name = "humidity", type = "float32" },
        { type = "unix_ms", assignment = "time" }
      ]},
    ]

    [inputs.socket_listener.binary.filter]
      selection = [{ offset = 0, bits = 16, match = "0xCAFE" }]

This is merely an example of what the final syntax could be. After reading the repeat count, the binary parser would write multiple entries at once.

Actual behavior

Currently, there's no way to parse different lengths of messages using the binary parser. The only possibility is parsing fixed-length structs.

Additional info

Some considerations that I haven't thoroughly thought about:

Handling timestamp per sub-entry or entry?
Where to read the repeat from? Right before the repeated entry or using an offset/bits combination? a. Should the offset use bits or types (uint8/16/32/64)?
How would tags outside the repeated entries apply to the inner ones? Maybe all tags outside are copied to the inner entries?

powersj commented 1 year ago

@srebhan thoughts on this proposal?

srebhan commented 1 year ago

@0xMihir I thought about such a feature and my idea was to declare a length-field and the use it in the "array"/"compound" like

[[inputs.socket_listener]]
  service_address = "udp://:8094"

  endianess = "le"
  data_format = "binary"

  [[inputs.socket_listener.binary]]
    metric_name = "multi-example"
    entries = [
      { bits = 16, omit = true },
      { assignment = "tag", name = "location",  type = "string", terminator: "null" },
      { assignment = "field", name = "_array_len",  type = "uint32},
      { assignment = "field",  name = "arrvalue", type = "uint32", length = "@_array_len"}, 
      { assignment = "field", name = "_compound_len",  type = "uint32},
      { assignment = "compound", name="entry", length = "@_compound_len",  entries = [
        { name = "temperature", type = "float32" },
        { name = "humidity", type = "float32" },
        { type = "unix_ms", assignment = "time" }
      ]},
    ]

    [inputs.socket_listener.binary.filter]
      selection = [{ offset = 0, bits = 16, match = "0xCAFE" }]

and I would expect the code to expand the array-field-names by appending the indices like arrvalue_1...arrvalue_N and the compound path like entry_1_temperature ... entry_M_temperature. If the length starts with an @ sign the length is dynamic but you could also simply put in a number. An alternative is to decide by value type in TOML...

My only concern is nesting... So maybe a dedicated parser for this kind of data would be better suited?

0xMihir commented 1 year ago

One issue with using fields in other places to determine the length of compound arrays is that we would need to implement a variable system for the parser. Maybe, we could use a termination character or sequence and then read until the sequence is detected—however, I'm not sure how to best handle overflows or underflows for this.

I think one way we could implement a parser is to recurse through the array of entries until everything has been read.

srebhan commented 1 year ago

I'm not sure I get your point @0xMihir. We store the length-field as a field and can thus access it when looping over the entries of a "compound"... You might be right regarding recursion.

Honestly speaking this increase in complexity worries me a bit. I can foresee the next request will be about "can we make each of these compounds a new metric" with all kind of intermixes between fields global for every metric and some to be only specific... Don't get me wrong, I'm not completely against this, but keep those things in mind during your design!

0xMihir commented 1 year ago

Yeah, I agree. We shouldn't try to scope this to be a full-blown parser, as there would be far too many edge cases for where to define metrics. I'm going to continue to try and investigate the best method for implementing this.

moracca commented 1 year ago

Adding my vote for this functionality. I have a situation where a wireless controller is sending MQTT data in binary format, and the message contains some header information, a few payload fields, including the number of device records being sent, then repeating sets of data, about 5 fields for each device.

influxdata / telegraf