larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
615 stars 194 forks source link

Streaming data deduplication #265

Open sridharpattem opened 6 years ago

sridharpattem commented 6 years ago

Hi, Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?

The flow is as follows.

Source Database ---> CDC ---> Kafka ----> Stream Processing (invoke Duke for duplicate check) -> Target Database

I would like to build the index as data is streaming in from the CDC, keep incrementing the index with new data and search the index at the same time for each message coming along. What is the way to do this? Or, do we always need at least two static data sets to find duplicates?

Thank you.

larsga commented 6 years ago

Yes, this is possible, if you index the records as you process them. The most efficient approach is to take some batch of records (say 10,000 records), index them all, commit the index, then search for duplicates. The API has methods for this.

ashubitm commented 6 years ago

i have a similar scenario where i have to dedupe records coming in streams against couchbase data as quickly as possible . Do we have couchbase source which can use index and call findCandidateMatches() against the couchbase and do quick deduping .

uderline commented 6 years ago

Hi @sridharpattem

I had the same issue with a data flow DB -> NiFi -> Logstash -> Elastic. I basically made an elastic plugin. If you want to have an idea on how to implement Duke in you code, feel free: https://github.com/minibigio/miniduke/blob/d0b51619cf2f080348a2f17f6c7932ce3617f89c/src/main/java/io/minibig/miniduke/ingest/MinidukeProcessor.java#L143

Good luck !

@larsga I have a question concerning the batch size . If you don't have any idea on how many records you are going to receive, what value do you assign ? How much does it matter if the batch size is too high ?

Thanks