rabbitmq / osiris

Log based streaming subsystem for RabbitMQ
Other
45 stars 10 forks source link

Efficient filtering at the chunk level #121

Closed kjnilsson closed 1 year ago

kjnilsson commented 1 year ago

Is your feature request related to a problem? Please describe.

Some streams may contain message data that not all consumers are interested in and thus would like an efficient way to filter out this data at the broker (in RabbitMQ's use case).

Describe the solution you'd like

Because of how osiris data is formatted on disk this filtering would need to be done at the chunk level so that the osiris_log reader can read the chunk header and then make a decision whether to return the current chunk or not (similarly to how offset reader filter non user chunks out).

Currently the chunk header has 4 unused bytes. We can use the first of these bytes to indicate the presence and size (in bytes, resulting in max bloom filter size of 2040 (255 * 8)) of a small bloom filter that immediately follows the chunk header.

The bloom filter is built up from any messages in the stream that include a "filter value", a smallish binary string that is used to populate the bloom filter.

A offset reader (consumer) can then pass a filter value and have any chunks that do not match the consumer filter value skipped.

Naturally there will be some false positives as this is normal with bloom filters and a single chunk can also contain messages not aimed at the reader so further filtering outside will still need to be done but assuming there aren't too many overlapping filter values this should allow readers to substantially (in some cases) reduce the amount of unwanted data they read.

There are some questions to ponder:

  1. what should be done when a chunk without any filter values is written. Should the bloom filter be empty?
  2. what should be done when a chunk contains a mixture of messages with a filter value and some without?

Particularly we need to consider the case where there are existing streams and the bloom filter is enabled and there are readers who want to filter but may still have messages written without a bloom filter that they need to read. At what point can they turn filtering on etc? (This is more of a RabbitMQ / feature flag concern).

Describe alternatives you've considered

No response

Additional context

No response

kjnilsson commented 1 year ago

Completed in #122