larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

Dedupe on Couchbase for real time streaming json (flink) #262

Open ashubitm opened 5 years ago

ashubitm commented 5 years ago

Hi ,I am trying to dedupe real time streaming json with destination as couchbase .I am trying to do this call for dedupe from flink but not able to perform . Can you please help with config file for couchbase and how to call our larsga/Dedupe from flink ?

uderline commented 5 years ago

Hi, There are no couchbase datasource - you can eventually make your own based on other datasources. I am not familiar with couchbase and flink. Can you make a request with couchbase, then make a deduplication with the returned json ? To call the deduplication function:

ashubitm commented 5 years ago

Thanks , What i am trying to do is to process a stream of jsons ( source) against couchbase DB (destination). Calling the json from couchbase may not be a gr8 idea from performance point of view and how many records to pull will be another thing . If MongoDB can have a direct destination why not couchbase ? If you can help around this will be gr8 . My purpose is if there are duplicates in the stream that is already present in destination i should not be saving those .

uderline commented 5 years ago

Sorry, I misread your first post. No problem for the json streaming if you're using flink or any other tool designed for streaming. MongoDB has a data source connection but not a destination connection. The matching records are always saved in the match listener whereas the ones with no matches are not saved. You will need to make the destination connection.

kyriehan89 commented 4 years ago

hi @ashubitm ,

I also have the same case, need to do dedup from couchbase,have you found the solution?

Thanks