kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

question / documentation #25

Open mikejacktzen opened 6 years ago

mikejacktzen commented 6 years ago

hi, this package seems great and much needed.

i had 2 questions, whose answers could strengthen the documentation / articles / vignettes.

  1. It seems like the workflow is designed to separate the 'blocking' step from the later 'estimation' step. Is this correct?

This would be a point to highlight as a strength.

one pain point in using the other package RecordLinkage https://cran.r-project.org/web/packages/RecordLinkage/index.html is that the blocking step was tied to the estimation step in a single overarching function When I was using the record linkage package, i wish it was broken out into two functions

form_blocks() estimate_params()

  1. Are the performance metrics output by the summary() method, 'conditional' or 'unconditional?

In the sense that, do the metrics apply on cases after blocking, or do the metrics apply on cases unconditional on the blocking?

"Sensitivity (%)",
 "Specificity (%)",
 "Positive Predicted Value (%)",
 "False Positive Rate (%)",
 "False Negative Rate (%)",
 "Correctly Clasified 

This was another pain point of RecordLinkage in that the performance metrics only applied to conditional cases after blocking

tedenamorado commented 6 years ago

Hi,

Thanks a lot for your great feedback! We are working on a vignette for fastLink with many examples and practical advice. We will definitively incorporate the points you raise.

Let me try to answer the questions you raise:

  1. It seems like the workflow is designed to separate the 'blocking' step from the later 'estimation' step. Is this correct?

Exactly! We designed fastLink as a solution to what do you do after blocking. In our workflow, those are two separate problems. We have some functions that you can use for blocking and we are in the works of incorporating new ones. We will make an announcement soon, as we plan a major release with new and improve functions for fastLink.

  1. Are the performance metrics output by the summary() method, 'conditional' or 'unconditional?

The summary function is conditional if you decide to perform your merge on blocked data. The summary function will be deprecated soon. We have produced a new function called confusion() that present the same information but has the advantage that you can aggregate the results from many blocks to make your results unconditional. Please take a look at this discussion where a similar point was raised.

If anything, please do not hesitate to contact us!

Ted

aalexandersson commented 6 years ago

Thanks, this is great news!

While RecordLinkage and fastLink can produce different outputs, some important differences are because of the workflow. Some distinctions that were made I think were overstated. For example, both software separate the 'blocking' step from the later 'estimation' step. Let me explain the workflow differences I have in mind by referring to AutoMatch, which is the traditional reference software in probabilistic record linkage. Soon to be replaced with fastLink as the new standard :-)

I guess that the typical RecordLinkage workflow is conditional matching (aka matching with replacement) -- similar to AutoMatch. In contrast, it seems to me that the typical fastLink usage is unconditional matching (aka matching without replacement) -- unlike AutoMatch. For example, in RecordLinkage gender could be used both as a blocking variable (e.g., in Pass 1) and as a matching variable (e.g., in pass 2). In contrast, in fastLink gender would be used either as blocking (e.g., on males in pass 1 and on females in pass 2) or as matching, not as both.

Because of the importance of gender for blocking in fastLink, I suggest it would be very helpful if at least one example in the vignette includes blocking on gender.

Anders

mikejacktzen commented 6 years ago

i agree with @aalexandersson in that the ideal flexible situation is to allow a field to be used optionally in both

  1. form_blocks(gender)
  2. estimate_params(gender)

I want to suggest the design so that 1 and 2 can be used modularly as estimate_params(field=gender,blocks=form_blocks(gender))

Although RecordLinkage allows the option of both, you have to progromatically request it in one step umbrella_block_and_estimate(param=gender,block=gender)

See the compare.dedup() example below which acts as my umbrella_block_and_estimate() above https://journal.r-project.org/archive/2010/RJ-2010-017/RJ-2010-017.pdf

Having one single function to do both was a bad design choice

Specifically, compare.dedup() did not easily allow a user to interrogate different blocking strategies