Open BERENZ opened 6 months ago
Disclaimer: I am a regular fastLink
user, not a fastLink
developer.
Great question. One possibility is "probabilistic blocking", see https://doi.org/10.1007/978-3-030-57521-2_16. Also, I am optimistic that the developers will release a faster and more accurate version of fastLink
real soon now.
How does your blocking package perform in terms of false positives (FP)? I think it would be helpful to address this in your (currently mostly empty) vignette at articles/v3-evaluation.html.
Thank you! The package is still under development but some initial results can be found in this notebook. Note that these metrics are on a block, not a unit level.
My approach is similar to the other paper on the syrian civil war causalties where they used the LSH algorithm. I am using different algorithms but the package allows for the LSH via the mlpack library.
What do you mean by new version? Is there a plan to include probabilistic blocking?
I understand that the current version of the fastLink package does not allow for what I am asking in this issue?
My understanding is that the current version of fastLink
can narrow down the number of possible pairs, as you ask for, by using fastLink
in a separate blocking step as "approximate probabilistic blocking". But it is only a proof of concept rather than established best practices.
Regarding the planned new version, there are posted tidbits here and there. For example, this is known since December: "We plan to release a new version of fastLink that will keep the same structure but will be faster and, more importantly, do so w/o sacrificing accuracy." Source: https://github.com/kosukeimai/fastLink/issues/73#issuecomment-1856192946
Hi @BERENZ,
You are totally correct! Using noisy fields for blocking can lead to errors, making traditional deterministic blocking techniques less effective. Thanks for your work on this topic! I am eager to learn more about your R-package.
LSH has shown promise for improving blocking techniques. I am currently developing a new algorithm based on LSH to scale Record Linkage tasks. Unlike traditional blocking, this algorithm focuses on how comparisons are made within and also across fields -- all within the FS framework.
As @aalexandersson mentioned, I plan to release updates to fastLink soon, which will aim to reduce the memory burden associated with handling a large number of comparisons.
Finally, if I understood correctly, you have many small-sized blocks and want to run fastLink on those records. If so, one practical solution might be to combine all the small-sized blocks and then run fastLink. The wrapper will perform all comparisons in the cross-product of the resulting data from your approach.
If I misunderstood anything, please let me know.
Ted
@tedenamorado thanks for kind words. Let me first start with the topic of this issue.
Finally, if I understood correctly, you have many small-sized blocks and want to run fastLink on those records. If so, one practical solution might be to combine all the small-sized blocks and then run fastLink. The wrapper will perform all comparisons in the cross-product of the resulting data from your approach.
I understand that grouping small-sized blocks into larger ones is an option. Could you please provide any suggestions regarding this approach? The result of my blocking procedure is a large number of 2-3 sized blocks. Should I group them to achieve a size of, say, 20-50 units? Is there a theoretical suggestion on how to do so? I would like to go beyond the rule of thumb.
Now, concerning the blocking
package
You are totally correct! Using noisy fields for blocking can lead to errors, making traditional deterministic blocking techniques less effective. Thanks for your work on this topic! I am eager to learn more about your R-package.
My package works as follows:
rnndescent
package which supports sparse matrices and is really fast). LSH is also available through the mlpack::lsh()
function.igraph
library to create a graph and return clusters of nodes.This is the complete receipt. :)
If you would like to test the package, please let me know. If you find any errors or have suggestions for improvement, please open an issue on the GitHub repository. I am available to assist with any questions you may have about the package.
Imagine that blocking variables are not free of errors (include typos, missing data etc.) so I cannot use
blockData
function to narrow down number of possible pairs.As a solution, I am using approximate nearest neighbours algorithms and graphs to create blocks of records (see my blocking package). This somehow mimics the microclustering approach (see also Johndrow et al. (2018)).
However, the resulting blocks are small, containing only 2, 3, or 4 pairs. This makes them unsuitable for use with the
blockData
function, as the blocks are too small.That is why I would like to ask if it is possible in some way to use the
fastLink
package for such cases? The main functionfastLink
takes all possible combinations fordfA
anddfB
but I would like to limit the number of pairs to the candidates and then apply the EM algorithm (FS model).Below you may find an example with the problem (detailed description can be found here).
You can see that the majority of blocks are of size 2.
Here is the list of candidate pairs.
I would like to use this information about pairs (x, y) for the linkage using the
fastLink
package.EDIT: to fully understand my problem I provide an example using the
reclin2
package.