superstreamlabs / memphis

Memphis.dev is a highly scalable and effortless data streaming platform
https://docs.memphis.dev
Other
3.19k stars 216 forks source link

Deduplication #654

Open yanivbh1 opened 1 year ago

yanivbh1 commented 1 year ago

Meaning

Data deduplication is a technique for eliminating duplicate copies of repeating data. When implemented right, it can increase performance significantly.

Memphis potential implementation

By implementing a highly reliable "bloom filter." A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. Usually, the Bloom filter is used to speed up answers in a key-value storage system.

devpahuja commented 1 year ago

@yanivbh1 Are we going to use this for making the operation of pushing events to memphis-broker idempotent(exactly once delivery message semantics)?

yanivbh1 commented 1 year ago

Hey @devpahuja, Not exactly. Idempotent already exists and is being used, but the depth-level of idempotency is just a message id - In case the broker finds a duplicate message id it drops the message while deduplication is data-aware and actually search for duplicate messages with duplicate payload, not just the id. We were thinking to implement it using bloom-filter. Would you like to work on it?

devpahuja commented 1 year ago

Hi @yanivbh1, yes, I would like to work on it.

yanivbh1 commented 1 year ago

@idanasulinmemphis do you think @devpahuja can go for it?

idanasulin2706 commented 1 year ago

Sure, @devpahuja think of it as on data based deduo rather than msg-id based dedup. let's say a producer is trying to send the "hello world" message more than once, then all duplicates should be blocked based on the message content.

Things to notice: msg = payload + headers duplicate = same payload + same headers

In order to accomplish this mission you will need to complete the following:

devpahuja commented 1 year ago

@yanivbh1 @idanasulinmemphis Memphis-broker is using an in-memory map for idempotency of messages. This can increase the memory usage of a broker in production environment on large scale. Is this use of in-memory map for idempotency reliable? Should we switch to a reliable key-value store like Redis/Cassandra for idempotency of message key? We can also use Redis for bloom filters. Redis provides an implementation of bloom filter. Or do we want to use an in-memory implementation of Bloom Filter in go?

@idanasulinmemphis Got it. Thanks for clarification. So, we will hash (payload + headers) while deduping using Bloom Filter.

idanasulin2706 commented 1 year ago

Yes it is an in-memory for performance reasons and it should stay as it is now, we recommend our users not use too large time-window since it is using the broker memory. Regarding the implementation of the bloom filter please implement it in a fixed size (from a configuration) + clean it every x time (from a configuration)