kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

not all patterns with NA counted? #41

Closed timbp closed 5 years ago

timbp commented 5 years ago

It seems that if a variable has missing values, not all patterns are counted. Is this intended?

`

g1 = gammaCKpar(dfA$firstname, dfB$firstname) g2 = gammaCKpar(dfA$lastname, dfB$lastname) tc = tableCounts(list(g1, g2), nrow(dfA), nrow(dfB)) Parallelizing calculation using OpenMP. 1 threads out of 8 are used. tc gamma.1 gamma.2 counts [1,] 0 0 172338 [2,] 1 0 271 [3,] 2 0 2170 [4,] 0 1 50 [5,] 0 2 120 [6,] 1 2 1 [7,] 2 2 50 attr(,"class") [1] "fastLink" "tableCounts"`

No missing values in these two variables. Counts sum to 175000 (== 500 * 350), and pattern (2, 2) has count of 50.

Add middlename, which has missing values:

`> g3 = gammaCKpar(dfA$middlename, dfB$middlename)

t = tableCounts(list(g1, g2, g3), nrow(dfA), nrow(dfB)) Parallelizing calculation using OpenMP. 1 threads out of 8 are used. t gamma.1 gamma.2 gamma.3 counts [1,] 0 0 0 115305 [2,] 1 0 0 193 [3,] 2 0 0 1477 [4,] 0 1 0 39 [5,] 0 2 0 79 [6,] 1 2 0 1 [7,] 0 0 1 24 [8,] 0 0 2 816 [9,] 1 0 2 2 [10,] 2 0 2 10 [11,] 0 2 2 1 [12,] 2 2 2 43 [13,] 0 0 NA 50690 [14,] 1 0 NA 68 [15,] 2 0 NA 615 [16,] 0 1 NA 10 [17,] 0 2 NA 37 attr(,"class") [1] "fastLink" "tableCounts"`

Counts now sum to 169410 so it appears 5590 pairs have not been counted. Pattern (2, 2, 2) has count of 43, but there are no other patterns starting (2, 2, ...) so 7 pairs that match on both firstname and lastname do not seem to appear in this table.

When I made my own code (in Julia) to count patterns, I got the following result: ` 0 0 0 115305

1 0 0 193 2 0 0 1477 0 1 0 39 0 2 0 79 1 2 0 1 0 0 1 24 0 0 2 816 1 0 2 2 2 0 2 10 0 2 2 1 2 2 2 43 0 0 missing 56193 1 0 missing 76 2 0 missing 683 0 1 missing 11 0 2 missing 40 2 2 missing 7`

Differences from the fastLink results are all in the patterns containing missing values.

tedenamorado commented 5 years ago

Hi,

Thanks a lot for raising this issue. Rest assured that we will take a close look. The counts for patterns that include a missing value should not miss pairs.

Ted

tedenamorado commented 5 years ago

Hi,

Thanks again for raising this issue!

There was a problem on how missing values were handled gammaCKpar(). The issue has been resolved and if you install using devtools your R code should produce the desired output.

If anything, please do not hesitate to reach out.

Ted

timbp commented 5 years ago

all looks good now