kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

Enhancement - Exact Matching Option #30

Closed Weekend-Warrior closed 6 years ago

Weekend-Warrior commented 6 years ago

Hi! Thanks for this wonderful contribution. As soon as I am able I will attempt a push but I wanted to get your feedback on the option to assign exact matches, even if probability of matching is too low. I am running into cases where my data is too small to make a probabilistic match and, though exact matches exist, the EM step does not output the indices despite noting them in the summary output.

Thanks in advance for your time and consideration.

tedenamorado commented 6 years ago

Hi,

If possible, could you share with us the results you obtain from the EM step? That would help a lot to diagnose what the problem is. If I understand you correctly, it seems that you problem is that even when you have exact matches in every linkage field (I imagine you only have a few of fields to merge) the probability of being a match is low, is that a correct assessment of your problem?

If anything, let us know.

All the best,

Ted

Weekend-Warrior commented 6 years ago

Sure!

$`zeta.j`
      [,1]
 [1,]    0
 [2,]    0
 [3,]    0
 [4,]    0
 [5,]    0
 [6,]    0
 [7,]    0
 [8,]    0
 [9,]    0
[10,]    0
[11,]    0
[12,]    0
[13,]    0
[14,]    0
[15,]    0
[16,]    0
[17,]    0
[18,]    0
[19,]    0
[20,]    0
[21,]    0
[22,]    0
[23,]    0

$p.m
[1] 0

$p.u
[1] 1

$p.gamma.k.m
$p.gamma.k.m[[1]]
numeric(0)

$p.gamma.k.m[[2]]
numeric(0)

$p.gamma.k.m[[3]]
numeric(0)

$p.gamma.k.m[[4]]
numeric(0)

$p.gamma.k.m[[5]]
[1] 1.000028964999313e-15 9.999999999999990e-01

$p.gamma.k.u
$p.gamma.k.u[[1]]
[1] 0.9927874764363576743 0.0062289976231456436 0.0009835259404966806

$p.gamma.k.u[[2]]
[1] 0.9970494221785100031 0.0027046963363658717 0.0002458814851241702

$p.gamma.k.u[[3]]
[1] 0.9978690271289238911 0.0019670518809933612 0.0001639209900827801

$p.gamma.k.u[[4]]
[1] 0.94148020654044751 0.05851979345955249

$p.gamma.k.u[[5]]
[1] 0.6483375959079284 0.3516624040920716

$p.gamma.j.m
                       [,1]
 [1,] 6.666859766662080e-17
 [2,] 6.666859766662080e-17
 [3,] 6.666859766662080e-17
 [4,] 6.666859766662080e-17
 [5,] 6.666859766662080e-17
 [6,] 6.666859766662080e-17
 [7,] 6.666859766662080e-17
 [8,] 6.666859766662080e-17
 [9,] 6.666666666666661e-02
[10,] 6.666666666666661e-02
[11,] 6.666666666666661e-02
[12,] 6.666666666666661e-02
[13,] 6.666666666666661e-02
[14,] 6.666666666666661e-02
[15,] 6.666666666666661e-02
[16,] 6.666666666666661e-02
[17,] 6.666666666666661e-02
[18,] 6.666666666666661e-02
[19,] 6.666666666666668e-02
[20,] 6.666666666666668e-02
[21,] 6.666666666666668e-02
[22,] 6.666666666666668e-02
[23,] 6.666666666666668e-02

$p.gamma.j.u
                       [,1]
 [1,] 3.035002645472656e-01
 [2,] 1.904236779129216e-03
 [3,] 8.233052799062697e-04
 [4,] 7.484593453693354e-05
 [5,] 5.982756754935828e-04
 [6,] 1.475402405656184e-07
 [7,] 1.886473307972035e-02
 [8,] 3.098929458680959e-06
 [9,] 1.646204590739606e-01
[10,] 1.032870047851153e-03
[11,] 1.630847443975504e-04
[12,] 4.465659802252942e-04
[13,] 4.059690729320853e-05
[14,] 2.704237520722148e-05
[15,] 3.245085024866574e-04
[16,] 1.023235028978915e-02
[17,] 1.013689453293732e-05
[18,] 4.927837081818937e-12
[19,] 4.681207236212263e-01
[20,] 1.154428418301419e-04
[21,] 2.275669982688629e-07
[22,] 2.909708336950949e-02
[23,] 1.401297671993604e-11

$patterns.w
      gamma.1 gamma.2 gamma.3 gamma.4 gamma.5 counts              weights           p.gamma.j.m           p.gamma.j.u
 [1,]       0       0       0       0       0   3631 -36.0544248346050082 6.666859766662080e-17 3.035002645472656e-01
 [2,]       1       0       0       0       0      2 -30.9831236398745382 6.666859766662080e-17 1.904236779129216e-03
 [3,]       0       1       0       0       0      2 -30.1446141408538679 6.666859766662080e-17 8.233052799062697e-04
 [4,]       0       2       0       0       0      4 -27.7467188680554955 6.666859766662080e-17 7.484593453693354e-05
 [5,]       0       0       2       0       0      1 -29.8253387170021611 6.666859766662080e-17 5.982756754935828e-04
 [6,]       0       2       2       0       0      1 -21.5176327504526483 6.666859766662080e-17 1.475402405656184e-07
 [7,]       0       0       0       2       0    208 -33.2763365567991869 6.666859766662080e-17 1.886473307972035e-02
 [8,]       0       0       1       2       0      1 -24.5623437894083452 6.666859766662080e-17 3.098929458680959e-06
 [9,]       0       0       0       0       2   6574  -0.9039374983495465 6.666666666666661e-02 1.646204590739606e-01
[10,]       1       0       0       0       2     10   4.1673636963809244 6.666666666666661e-02 1.032870047851153e-03
[11,]       2       0       0       0       2     48   6.0131903868792556 6.666666666666661e-02 1.630847443975504e-04
[12,]       0       1       0       0       2      1   5.0058731954015965 6.666666666666661e-02 4.465659802252942e-04
[13,]       0       2       0       0       2      4   7.4037684681999671 6.666666666666661e-02 4.059690729320853e-05
[14,]       0       0       1       0       2      1   7.8100552690412979 6.666666666666661e-02 2.704237520722148e-05
[15,]       0       0       2       0       2      1   5.3251486192532997 6.666666666666661e-02 3.245085024866574e-04
[16,]       0       0       0       2       2    433   1.8741507794562726 6.666666666666661e-02 1.023235028978915e-02
[17,]       2       0       0       2       2      8   8.7912786646850751 6.666666666666661e-02 1.013689453293732e-05
[18,]       2       2       2       2       2     18  23.3280707488374368 6.666666666666661e-02 4.927837081818937e-12
[19,]       0       0       0       0      NA   1203  -1.9490211412282854 6.666666666666668e-02 4.681207236212263e-01
[20,]       0       2       0       0      NA      3   6.3586848253212285 6.666666666666668e-02 1.154428418301419e-04
[21,]       0       2       2       0      NA      1  12.5877709429240738 6.666666666666668e-02 2.275669982688629e-07
[22,]       0       0       0       2      NA     44   0.8290671365775344 6.666666666666668e-02 2.909708336950949e-02
[23,]       2       2       2       2      NA      2  22.2829871059586964 6.666666666666668e-02 1.401297671993604e-11

$iter.converge
[1] 7

$nobs.a
[1] 49

$nobs.b
[1] 249

$varnames
[1] "FirstName"   "LastName"    "Address"     "DateOfBirth" "gender"     

attr(,"class")
[1] "fastLink"    "fastLink.EM" 
tedenamorado commented 6 years ago

Thanks a lot! As you suspect, the problem is that we have only two patters with some traction in terms of being matches. I have two alternatives for you:

  1. In a Fellegi-Sunter under conditional independence, sometimes numerical underflow is an issue. To avoid that change the convergence criteria tol for the EM. For example: tol = 1e-03 should work in your case.

  2. Move to the model that relaxes the conditional independence assumption. This would require you to add the option cond.indep = FALSE in fastLink() or to use emlinklog() instead of emlinkMARmov().

See ?fastLink, ?emlinklog, and ?emlinkMARmov for more information.

I hope that helps! If anything, please let us know.

All the best,

Ted

Weekend-Warrior commented 6 years ago

Hey Ted,

Thanks for this. Relaxing conditional independence did the trick.

Best regards,

Stewart