kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
272 stars 48 forks source link

Measure distance to nearest group #57

Open shamahutoto opened 3 years ago

shamahutoto commented 3 years ago

Hi there,

I want to find items that aren't matched but were just under the threshold for matching with a group. Is there a way to do this?

aalexandersson commented 3 years ago

Disclaimer: I am a regular fastLink user, not a developer.

Please give an example to make the issue easier to understand.

For example, this copy-pasted code will to subset to threshold match 0.85 and above:

matched_dfs <- getMatches(
  dfA = dfA, dfB = dfB, 
  fl.out = matches.out, threshold.match = 0.85
)

I guess that you need to subset with blocking which is doable but more complicated. The developers are working on improving the blocking functionality.

tedenamorado commented 3 years ago

Hi @shamahutoto,

As @aalexandersson mentions, one idea here would be to lower the matching threshold. By default fastLink only returns pairs of records with a matching probability larger than 0.85. However, you can lower that value to e.g., 0.001 and recover pairs with a matching probability larger than that value which will be a larger group than the one produced by the default value. However, I would not recommend going too low as you will get pairs of records with a value that is basically 0 and if the datasets you are matching are large, then the fastLink objects will be incredibly large.

If anything, let us know.

All my best,

Ted