Bigger bloom filter size

AntonSmolkov commented 4 months ago

Hi there!

Now, the stream's Bloom filter size is limited to only 255 bytes. The reason for this limitation is detailed in the corresponding article:

So, the larger the filter, the better? Not exactly: even though a Bloom filter is very efficient in terms of storage, as it does not store elements, just whether elements are in the set, the filter size is pre-allocated. If you set the filter size to 255 and each chunk contains at least one message with a filter value, there will be 255 bytes allocated in each chunk header.

Additionally, the .NET client library documentation mentions:

Cardinality of filter values can be from a few to a few thousands. Extreme cardinality (a couple or dozens of thousands) can make filtering less efficient. Given that in my case, the filter cardinality is significant (dozens or even hundreds of thousands), but the data rate is also enormous (million+ of requests per second), I understand that data chunks will also be substantial. Therefore, the extra 10 KiB per chunk should not make any difference for me.

So, why not allow setting any arbitrary integer for the Bloom filter size? Are there any technical restrictions to consider?

Describe the solution you'd like

I'd like to have the ability to set any arbitrary limit to the bloom filter size. Maybe this ability must be enabled additioanly via feature flag beforehand.

Describe alternatives you've considered

Smart algorithm that set filter size for each new chunk based on the statistics about previous chunks sizes.
Streams partitioning (maybe super streams). But for my case this would require creation of thouthands of partitiions and handling them on client side (special filters set per each partition)

Additional context

Implementation of the requested feature whould make RabbitMQ Streams more competitive over NATS JetStreams, which consumer's have FilterSubjects property
Filter size for 25k elemets with 20% error rate

acogoluegnes commented 4 months ago

The filter size is stored in 1 byte, that is why the filter size cannot be more than 255, which is a reasonable Bloom filter size for the use cases we want to address with stream filtering.

AntonSmolkov commented 4 months ago

%% |0 |1 |2 |3 | Bytes %% |0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7| Bits %% +-------+-------+---------------+---------------+---------------+ %% | Bloom Size | Reserved | %% +---------------------------------------------------------------+

If this is indeed the only technical limitation, and I understand it correctly, then utilizing an additional byte from the reserved space would enable defining a filter size up to 65,535 bytes (65,535 * 8 = 524,280 bits) . Such a filter size would allow for approximately 100,000 entries with an error rate of 10%, compared to the current limit of 388 entries with a 255-byte filter. Such filter size would let to have ~100k entries with error rate of 10% (instead of curret 388 entries, with 255 bytes filter)

I have a use case like this: There is an IoT system that collects data from millions (approximately 1 or 2 million) sensors. We are considering publishing this data to stream (or a superstream) so that users (microservices) can consume data only from subsets of sensors they wish to access (in the tens of thousands).

I stumbled upon RabbitMQ Streams and its new filter functionality and thought it could be a matching solution. Please let me know if this use case is totally exotic for the RabbitMQ.

michaelklishin commented 4 months ago

The filter size is stored in 1 byte, that is why the filter size cannot be more than 255, which is a reasonable Bloom filter size for the use cases we want to address with stream filtering.

To add to that, the choice of a single byte field was by design.

michaelklishin commented 4 months ago

It is not "totally exotic" but most users with fleets of devices use MQTT with queues.

We don't have sufficient evidence for much interest in larger cardinality for stream filtering.

michaelklishin commented 4 months ago

@AntonSmolkov going from what Osiris has today to millions of "microstreams" within a stream would require 24 or 32 bytes for the Bloom filter. That's a heck of a jump to consider it to be a no-brainer, and there are only 4 bytes available in the chunk header right now.

You probably should stick to MQTT with queues (see above) for now.

AntonSmolkov commented 4 months ago

@AntonSmolkov going from what Osiris has today to millions of "microstreams" within a stream would require 24 or 32 bytes for the Bloom filter. That's a heck of a jump to consider it to be a no-brainer, and there are only 4 bytes available

Well, given that consumers(applications) would subscribe to only tens of thousands of these "microstreams", these 1 million microstreams can be distributed to 10 (or more) streams using consistent hashing. Using one extra byte seems to be sufficient under these circumstances. (100k entries per filter with error rate of 10%, as calculated above)

It is not "totally exotic" but most users with fleets of devices use MQTT with queues.

I'm concerned that regular queues might not be able to handle such a high ingest rate. In my scenario, the data from the sensors is already collected in a massive Kafka topic; the objective is to distribute subsets of data (sensors) to dozens of consumers (applications) with minimal traffic and storage overhead. But i'll explore this option further, thank you.

We don't have sufficient evidence for much interest in larger cardinality for stream filtering.

Let this enhancement proposal serve as the first piece of evidence :)

rabbitmq / osiris