larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

What is the accuracy rate of duke? like out of 100% #261

Closed Anishx closed 5 years ago

Anishx commented 5 years ago

I didn't know any other way to ask other than opening an issue. Can i use Duke for enterprise projects? And how accurate is duke in Patient Matching and linking process in a 0-100% scale ?

larsga commented 5 years ago

Unfortunately, there is no single answer to this. It depends on your data, and also on how you set Duke up. So anything from 0 to 100% is possible, depending on the case.

In reality, the state of input data tends to mean that absolute accuracy is impossible, unless you have human beings who physically track down the patients and talk to them. The root problem is that the algorithm is working in a situation where a lot of information is missing.

Note that there is a trade-off in record linkage: if you want to find all correct matches you will most likely have to accept some percentage of false matches. By lowering the threshold and tuning the configuration you can reduce the percentage of false matches, probably at some cost in losing some correct matches.

uderline commented 5 years ago

Hi ! I personnaly use Duke for an enterprise project which is going to be used for a city hall in France so I hope it works ;) As @larsga said, you can adapt for less false matching. In the case of a linkage situation we decided to lower probabilities from properties and using the threshold maybe (that we call "suspicious") and then ask a human to check the suspicious reconciliation values. In any case, we think that before any use of the information - and especially if it's medical/sensitive or personal info - everything should be checked by a human before export.

Anishx commented 5 years ago

@larsga @uderline Thanks for replying ... @uderline if i may ask, do u have a code example that you can share for your use case that you mentioned ?

uderline commented 5 years ago

@Anishx What do you mean ? A configuration file example ? It's not going to help you more than the documentation of Duke because it really depends of your data. Also, I added personal comparators so it's not really going to help. Though, if you really want an example with values, I made an IT file for my project where you can have the properties. The constructor has the most important part of the configuration. reset() will give you the rest of it. The data to link is in each method and the test file is called miniduke_test.json

Anishx commented 5 years ago

@uderline Thank you ! That's all i needed! So, i'll close the issue