Open yanivbh1 opened 1 year ago
@yanivbh1 Are we going to use this for making the operation of pushing events to memphis-broker idempotent(exactly once delivery message semantics)?
Hey @devpahuja,
Not exactly. Idempotent already exists and is being used, but the depth-level of idempotency is just a message id
-
In case the broker finds a duplicate message id
it drops the message while deduplication is data-aware and actually search for duplicate messages with duplicate payload, not just the id.
We were thinking to implement it using bloom-filter. Would you like to work on it?
Hi @yanivbh1, yes, I would like to work on it.
@idanasulinmemphis do you think @devpahuja can go for it?
Sure, @devpahuja think of it as on data based deduo rather than msg-id based dedup. let's say a producer is trying to send the "hello world" message more than once, then all duplicates should be blocked based on the message content.
Things to notice: msg = payload + headers duplicate = same payload + same headers
In order to accomplish this mission you will need to complete the following:
@yanivbh1 @idanasulinmemphis Memphis-broker is using an in-memory map for idempotency of messages. This can increase the memory usage of a broker in production environment on large scale. Is this use of in-memory map for idempotency reliable? Should we switch to a reliable key-value store like Redis/Cassandra for idempotency of message key? We can also use Redis for bloom filters. Redis provides an implementation of bloom filter. Or do we want to use an in-memory implementation of Bloom Filter in go?
@idanasulinmemphis Got it. Thanks for clarification. So, we will hash (payload + headers) while deduping using Bloom Filter.
Yes it is an in-memory for performance reasons and it should stay as it is now, we recommend our users not use too large time-window since it is using the broker memory. Regarding the implementation of the bloom filter please implement it in a fixed size (from a configuration) + clean it every x time (from a configuration)
Meaning
Data deduplication is a technique for eliminating duplicate copies of repeating data. When implemented right, it can increase performance significantly.
Memphis potential implementation
By implementing a highly reliable "bloom filter." A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. Usually, the Bloom filter is used to speed up answers in a key-value storage system.