Open mikejacktzen opened 6 years ago
Hi,
Thanks a lot for your great feedback! We are working on a vignette for fastLink with many examples and practical advice. We will definitively incorporate the points you raise.
Let me try to answer the questions you raise:
Exactly! We designed fastLink as a solution to what do you do after blocking. In our workflow, those are two separate problems. We have some functions that you can use for blocking and we are in the works of incorporating new ones. We will make an announcement soon, as we plan a major release with new and improve functions for fastLink.
The summary function is conditional if you decide to perform your merge on blocked data. The summary function will be deprecated soon. We have produced a new function called confusion()
that present the same information but has the advantage that you can aggregate the results from many blocks to make your results unconditional. Please take a look at this discussion where a similar point was raised.
If anything, please do not hesitate to contact us!
Ted
Thanks, this is great news!
While RecordLinkage
and fastLink
can produce different outputs, some important differences are because of the workflow. Some distinctions that were made I think were overstated. For example, both software separate the 'blocking' step from the later 'estimation' step. Let me explain the workflow differences I have in mind by referring to AutoMatch, which is the traditional reference software in probabilistic record linkage. Soon to be replaced with fastLink
as the new standard :-)
I guess that the typical RecordLinkage
workflow is conditional matching (aka matching with replacement) -- similar to AutoMatch. In contrast, it seems to me that the typical fastLink
usage is unconditional matching (aka matching without replacement) -- unlike AutoMatch. For example, in RecordLinkage
gender could be used both as a blocking variable (e.g., in Pass 1) and as a matching variable (e.g., in pass 2). In contrast, in fastLink
gender would be used either as blocking (e.g., on males in pass 1 and on females in pass 2) or as matching, not as both.
Because of the importance of gender for blocking in fastLink
, I suggest it would be very helpful if at least one example in the vignette includes blocking on gender.
Anders
i agree with @aalexandersson in that the ideal flexible situation is to allow a field to be used optionally in both
form_blocks(gender)
estimate_params(gender)
I want to suggest the design so that 1 and 2 can be used modularly as
estimate_params(field=gender,blocks=form_blocks(gender))
Although RecordLinkage
allows the option of both, you have to progromatically request it in one step
umbrella_block_and_estimate(param=gender,block=gender)
See the compare.dedup()
example below which acts as my umbrella_block_and_estimate()
above
https://journal.r-project.org/archive/2010/RJ-2010-017/RJ-2010-017.pdf
Having one single function to do both was a bad design choice
Specifically, compare.dedup()
did not easily allow a user to interrogate different blocking strategies
hi, this package seems great and much needed.
i had 2 questions, whose answers could strengthen the documentation / articles / vignettes.
This would be a point to highlight as a strength.
one pain point in using the other package
RecordLinkage
https://cran.r-project.org/web/packages/RecordLinkage/index.html is that the blocking step was tied to the estimation step in a single overarching function When I was using the record linkage package, i wish it was broken out into two functionsform_blocks()
estimate_params()
summary()
method, 'conditional' or 'unconditional?In the sense that, do the metrics apply on cases after blocking, or do the metrics apply on cases unconditional on the blocking?
This was another pain point of
RecordLinkage
in that the performance metrics only applied to conditional cases after blocking