Open sridharpattem opened 6 years ago
Yes, this is possible, if you index the records as you process them. The most efficient approach is to take some batch of records (say 10,000 records), index them all, commit the index, then search for duplicates. The API has methods for this.
i have a similar scenario where i have to dedupe records coming in streams against couchbase data as quickly as possible . Do we have couchbase source which can use index and call findCandidateMatches() against the couchbase and do quick deduping .
Hi @sridharpattem
I had the same issue with a data flow DB -> NiFi -> Logstash -> Elastic. I basically made an elastic plugin. If you want to have an idea on how to implement Duke in you code, feel free: https://github.com/minibigio/miniduke/blob/d0b51619cf2f080348a2f17f6c7932ce3617f89c/src/main/java/io/minibig/miniduke/ingest/MinidukeProcessor.java#L143
Good luck !
@larsga I have a question concerning the batch size . If you don't have any idea on how many records you are going to receive, what value do you assign ? How much does it matter if the batch size is too high ?
Thanks
Hi, Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?
The flow is as follows.
Source Database ---> CDC ---> Kafka ----> Stream Processing (invoke Duke for duplicate check) -> Target Database
I would like to build the index as data is streaming in from the CDC, keep incrementing the index with new data and search the index at the same time for each message coming along. What is the way to do this? Or, do we always need at least two static data sets to find duplicates?
Thank you.