kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

question - view posterior for all pairs of rows #36

Open mbcann01 opened 5 years ago

mbcann01 commented 5 years ago

Hey guys,

Thank you for developing this package. I think it will be really useful to me. It seems like the default behavior for fastLink::fastLink is to return the indices and posterior probabilities for matching rows only.

Is there a way to view the posterior probabilities for ALL pairs of rows?

I'm interested in knowing if there were any pairs that were NOT matched, but should be.

Thank you, Brad

bfifield commented 5 years ago

Hi Brad, glad to hear it's been helpful to you. There are a few options.

First, if you set return.all = TRUE (the default), then fastLink() will return all matches with a posterior match probability above 0.0001, and will it automatically dedupe all matches. In many settings, if we don't dedupe, the resulting object is very large (nearly N_a x N_b rows).

However, if you have enough memory to return the indices for all pairwise comparisons, you can simply set dedupe.matches = FALSE and return.all = TRUE. fastLink() will spit out a warning that says "You have specified that all matches be returned but are not deduping the matches. The resulting object may be very large." This will return the indices and posteriors for all pairwise comparisons where the posterior match probability is greater than 0.0001, without deduping.

If you want to set the threshold even lower, you can override this in a slightly convoluted way, which we should probably make easier to do. Set return.all = FALSE, dedupe.matches = TRUE, and threshold.match = [some even smaller value].

Let me know if you have any more questions - I hope this is helpful!

Ben

mbcann01 commented 5 years ago

Hi Ben,

Thank you very much for the response!

It's possible that I'm still doing something wrong, but I don't think your solution works in my particular situation because I'm trying to find matches within a single data set rather than two data sets. I should have been more specific from the begininning. Below I go into greater detail:

Background

In my data, each row represents a unique treatment interaction with emergency medical services (“incident”). However, some people were treated at more than once. Therefore, incident number doesn't necessarily uniquely identify people. For some measures (e.g., demographic characteristics of the people we treated) we really want to be able to do analysis at the person level.

Initially, we attempted to uniquely identify people based on a combination of name and DOB (i.e., just using the paste function). However, creating a unique identifier this way is complicated by data entry errors (i.e., misspelled names and mistyped dates of birth). For example, john_smith_1936_10_03 and jon_smith_1936_10_03 may be the same person, but would generate two different unique id numbers.

So, we began using the RecordLinkage package to find matches and assign unique id numbers. This worked really well for the analysis of some of our pilot data; however, we are now analyzing data from a much longer time period, and I’d really like to switch over to using fastLink for some of the speed and scalability reasons that you mention in your publications.

Here's a small reproducible example of what I'm trying to do and the issue I'm encountering:

Simulate data

Below, we have the following people:

  1. John Smith (unique person 1)

  2. Jane Smith (unique person 2)

  3. Joy Smith - John's wife who lives at the same address (unique person 3)

  4. Amy Jones - no matches (unique person 4)

df <- tibble(
  incident   = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008),
  nm_first   = c("john", "john", "jane", "jon", "jane", "joy", "michael", "amy"),
  nm_last    = c(rep("smith", 7), "jones"),
  sex        = c("m", "m", "f", "m", "f", "f", "m", "f"),
  birth_mnth = c(9, 9, 2, 9, 3, 8, 9, 1),
  birth_year = c(1936, 1936, 1937, 1936, 1937, 1941, 1936, 1947),
  add_num    = c(101, 101, 14, 101, 14, 101, 101, 1405),
  add_street = c("main", "main", "elm", "main", "elm", "main", "main", "texas")
) %>% 
  mutate(row = row_number()) %>% 
  select(row, everything()) %>% 
  print()

I've also included the following data entry errors/matching challenges to resolve:

  1. In row 4, the "h" is missing from "john"

  2. In row 5, there is a "3" for birth_mnth instead of a "2"

  3. In row 7, the medic entered John Smith's middle name, "michael", in the first name column.

  4. Joy and John have a lot of information in common, but we want to be able to classify them as a non-match.

#> # A tibble: 8 x 9
#>     row incident nm_first nm_last sex   birth_mnth birth_year add_num add_street
#>   <int>    <dbl> <chr>    <chr>   <chr>      <dbl>      <dbl>   <dbl> <chr>     
#> 1     1     1001 john     smith   m              9       1936     101 main      
#> 2     2     1002 john     smith   m              9       1936     101 main      
#> 3     3     1003 jane     smith   f              2       1937      14 elm       
#> 4     4     1004 jon      smith   m              9       1936     101 main      
#> 5     5     1005 jane     smith   f              3       1937      14 elm       
#> 6     6     1006 joy      smith   f              8       1941     101 main      
#> 7     7     1007 michael  smith   m              9       1936     101 main      
#> 8     8     1008 amy      jones   f              1       1947    1405 texas

Reduce search space

In the real data, if we attempt to determine if every individual row in the entire data set is a match with every other individual row, it is a costly process. It becomes much more tractable if we temporarily reduce the data to one row for each combination of values that we want to investigate. In this case, we want to investigate all unique combinations of name, DOB, and address.

df_unique_combo <- df %>% 
  select(-row) %>% 
  mutate(group = paste(nm_first, nm_last, birth_year, birth_mnth, add_num, add_street, sep = "_")) %>%
  group_by(group) %>% 
  filter(row_number() == 1) %>% 
  ungroup()

Incident 1002 was dropped because it didn't have any unique information.

Use fastLink

Find unique people with fastLink

fastlink_out <- fastLink::fastLink(
  dfA = df_unique_combo,
  dfB = df_unique_combo,
  varnames = c("nm_first", "nm_last", "birth_mnth", "birth_year", "add_num", "add_street"),
  stringdist.match = c("nm_first", "nm_last", "add_street"),
  numeric.match = c("birth_mnth", "birth_year", "add_num"),
  dedupe.matches = FALSE,
  return.all = TRUE
)
screen shot 2019-01-06 at 9 21 09 am

As you can see in the screenshot above, even with return.all set to TRUE, the default behavior of fastLink is dfA and dfB are identical, assuming deduplication of a single data set. Setting return.all to FALSE.

I believe it's due to this part of the fastLink code:

if (identical(dfA, dfB)) {
    cat("dfA and dfB are identical, assuming deduplication of a single data set.\nSetting return.all to FALSE.\n\n")
    dedupe.matches <- FALSE
    return.all <- FALSE
    dedupe.df <- TRUE
  }

Potential solution

So, one potential solution I can think of is to copy the fastLink code into my own function and just comment out that section of code. Would this break something later that I'm not thinking of (i.e., I'm assuming there was a good reason to include that check in the first place)? Can you think of a better solution?

Also, for the sake of completeness I'll go ahead and outline the rest of the process and challenges for my particular use case – just in case it’s useful to you or others.

View matches

Next, I want to manually review matches/non-matches for false positives and false negatives. In a large data set, I would probably want to do this for some small subset in the middle range of posterior probabilities (i.e., I trust that extremely high posteriors are true matches and extremely low posteriors are true non-matches). Then, we can find the posterior cut-off to use that results in the need for the least amount of manual data cleaning.

df_dedupe <- fastLink::getMatches(
  dfA = df_unique_combo,
  dfB = df_unique_combo,
  fl.out = fastlink_out
)
screen shot 2019-01-06 at 9 33 16 am

I can't get getMatches to work, so I made my own function:

fmr_fastlink_stack_matches <- function(.x, df) {

  # ===========================================================================
  # Error checks
  # ===========================================================================
  if (!("fastLink" %in% class(.x))) {
    stop('.x must be of class "fastLink"')
  }

  # ===========================================================================
  # Create data frame of potential matches to compare
  # ===========================================================================
  potential_matches <- tibble::tibble(
    row = .x$matches$inds.b,
    matching_row = .x$matches$inds.a,
    posterior_probability = .x$posterior
  )

  # Keep only unique combinations of rows (e.g., not 1-3 and 3-1)
  potential_matches <- potential_matches %>%
    dplyr::mutate(
      combo = purrr::map2_chr(
        .x = row,
        .y = matching_row,
        .f = function(x, y) {
          min <- min(x, y)
          max <- max(x, y)
          out <- paste(min, max, sep = "_")
          out
        }
      ),
      dup = duplicated(combo)
    ) %>%
    dplyr::filter(!dup) %>%
    dplyr::select(-combo, -dup)

  # ===========================================================================
  # Manipulate the potential matches data frame
  # Stack row and matching row on top of each other
  # Add a pair number to each row and matching row
  # ===========================================================================
  stacked_potential_matches <- tibble::tibble(
    row = c(rbind(potential_matches[["row"]], potential_matches[["matching_row"]])),
    pair = rep(seq(1, length(row) / 2), each = 2),
    posterior_probability = rep(potential_matches[["posterior_probability"]], each = 2)
  )

  # ===========================================================================
  # Add substantive variables of interest to matched pairs for review
  # ===========================================================================
  out <- stacked_potential_matches %>% 
    dplyr::left_join(
      df %>% 
        dplyr::mutate(row = dplyr::row_number()),
      by = "row"
    )

  # ===========================================================================
  # Return data frame of potential matches to compare
  # ===========================================================================
  out
}
matches <- fmr_fastlink_stack_matches(fastlink_out, df_unique_combo)
matches
#> # A tibble: 28 x 12
#>      row  pair posterior_probability incident nm_first nm_last sex   birth_mnth birth_year add_num add_street group                        
#>    <dbl> <int>                 <dbl>    <dbl> <chr>    <chr>   <chr>      <dbl>      <dbl>   <dbl> <chr>      <chr>                        
#>  1     1     1                     1     1001 john     smith   m              9       1936     101 main       john_smith_1936_9_101_main   
#>  2     1     1                     1     1001 john     smith   m              9       1936     101 main       john_smith_1936_9_101_main   
#>  3     1     2                     1     1001 john     smith   m              9       1936     101 main       john_smith_1936_9_101_main   
#>  4     3     2                     1     1004 jon      smith   m              9       1936     101 main       jon_smith_1936_9_101_main    
#>  5     1     3                     1     1001 john     smith   m              9       1936     101 main       john_smith_1936_9_101_main   
#>  6     5     3                     1     1006 joy      smith   f              8       1941     101 main       joy_smith_1941_8_101_main    
#>  7     1     4                     1     1001 john     smith   m              9       1936     101 main       john_smith_1936_9_101_main   
#>  8     6     4                     1     1007 michael  smith   m              9       1936     101 main       michael_smith_1936_9_101_main
#>  9     2     5                     1     1003 jane     smith   f              2       1937      14 elm        jane_smith_1937_2_14_elm     
#> 10     2     5                     1     1003 jane     smith   f              2       1937      14 elm        jane_smith_1937_2_14_elm     
#> 11     2     6                     1     1003 jane     smith   f              2       1937      14 elm        jane_smith_1937_2_14_elm     
#> 12     4     6                     1     1005 jane     smith   f              3       1937      14 elm        jane_smith_1937_3_14_elm     
#> 13     3     7                     1     1004 jon      smith   m              9       1936     101 main       jon_smith_1936_9_101_main    
#> 14     3     7                     1     1004 jon      smith   m              9       1936     101 main       jon_smith_1936_9_101_main    
#> 15     3     8                     1     1004 jon      smith   m              9       1936     101 main       jon_smith_1936_9_101_main    
#> 16     5     8                     1     1006 joy      smith   f              8       1941     101 main       joy_smith_1941_8_101_main    
#> 17     3     9                     1     1004 jon      smith   m              9       1936     101 main       jon_smith_1936_9_101_main    
#> 18     6     9                     1     1007 michael  smith   m              9       1936     101 main       michael_smith_1936_9_101_main
#> 19     4    10                     1     1005 jane     smith   f              3       1937      14 elm        jane_smith_1937_3_14_elm     
#> 20     4    10                     1     1005 jane     smith   f              3       1937      14 elm        jane_smith_1937_3_14_elm     
#> 21     5    11                     1     1006 joy      smith   f              8       1941     101 main       joy_smith_1941_8_101_main    
#> 22     5    11                     1     1006 joy      smith   f              8       1941     101 main       joy_smith_1941_8_101_main    
#> 23     5    12                     1     1006 joy      smith   f              8       1941     101 main       joy_smith_1941_8_101_main    
#> 24     6    12                     1     1007 michael  smith   m              9       1936     101 main       michael_smith_1936_9_101_main
#> 25     6    13                     1     1007 michael  smith   m              9       1936     101 main       michael_smith_1936_9_101_main
#> 26     6    13                     1     1007 michael  smith   m              9       1936     101 main       michael_smith_1936_9_101_main
#> 27     7    14                     1     1008 amy      jones   f              1       1947    1405 texas      amy_jones_1947_1_1405_texas  
#> 28     7    14                     1     1008 amy      jones   f              1       1947    1405 texas      amy_jones_1947_1_1405_texas  

Matches we wanted:

✅ fastLink matched john and jon

✅ fastLink matched jane and jane (with birth_mnth error)

✅ fastLink matched john, jon, and michael

Matches we don't want:

🚫 fastLink matched john and joy

🚫 fastLink matched jon and joy

🚫 fastLink matched michael and joy

Add person id

fmr_add_unique_id <- function(df, fastLink_obj) {

  # ===========================================================================
  # Create tibble of matching rows
  # ===========================================================================

  # Turn fastLink results into tibble of rows and their matches
  matches <- tibble::tibble(
    row = fastLink_obj$matches$inds.b,
    matching_row = fastLink_obj$matches$inds.a
  )

  # Nest all matches for each row
  matches <- matches %>% 
    dplyr::group_by(row) %>% 
    dplyr::mutate(matches = list(matching_row)) %>%
    dplyr::ungroup()

  # Reduce to unique sets of matching rows
  # i.e. 1,2,3 is the same as 3,2,1
  matches <- matches %>% 
    dplyr::mutate(matches = purrr::map_chr(
      matches,
      function(x) {
        x = sort(x) # 1,2,3 is the same as 3,2,1
        x = paste(x, collapse = ",") # Convert list to character string
        x
      })
    ) %>% 
    dplyr::select(matches) %>%
    distinct() # Reduce to 1 row per group of matches

  # Sequentially number each group of matches
  # This will become the unique id
  matches <- matches %>%
    dplyr::mutate(
      id = row_number(),
      row = purrr::map( # Turn back into list
        matches,
        ~ scan(text = ., what = 0L, sep = ",", quiet = TRUE)
      )
    ) 

  # Covert to data frame with the appropriate id number for each row in the 
  # original data set
  matches <- matches %>%
    tidyr::unnest() %>% 
    dplyr::select(id, row)

  # Join id number back to original data set
  out <- matches %>% 
    dplyr::right_join(
      df %>% 
        dplyr::mutate(row = dplyr::row_number()),
      by = "row"
    ) %>% 
    select(-row)

  # ===========================================================================
  # Return original data frame with unique id added
  # ===========================================================================
  out
}
df_unique_combo_w_id <- fmr_add_unique_id(df_unique_combo, fastlink_out)
df_unique_combo_w_id
#> # A tibble: 7 x 10
#>      id incident nm_first nm_last sex   birth_mnth birth_year add_num add_street group                        
#>   <int>    <dbl> <chr>    <chr>   <chr>      <dbl>      <dbl>   <dbl> <chr>      <chr>                        
#> 1     1     1001 john     smith   m              9       1936     101 main       john_smith_1936_9_101_main   
#> 2     2     1003 jane     smith   f              2       1937      14 elm        jane_smith_1937_2_14_elm     
#> 3     1     1004 jon      smith   m              9       1936     101 main       jon_smith_1936_9_101_main    
#> 4     2     1005 jane     smith   f              3       1937      14 elm        jane_smith_1937_3_14_elm     
#> 5     1     1006 joy      smith   f              8       1941     101 main       joy_smith_1941_8_101_main    
#> 6     1     1007 michael  smith   m              9       1936     101 main       michael_smith_1936_9_101_main
#> 7     3     1008 amy      jones   f              1       1947    1405 texas      amy_jones_1947_1_1405_texas  

✅ John, Jon, and Micheal have id == 1

✅ Jane has id == 2

✅ Amy has id == 3

🚫 Joy has id == 1

df_unique_combo_w_id <- df_unique_combo_w_id %>% 
  mutate(id = if_else(nm_first == "joy", 4L, id))

Add id to full data

Finally, join the unique id numbers back to full data set

df_w_id <- df %>% 
  dplyr::left_join(
    df_unique_combo_w_id %>% 
      dplyr::select(id, group), 
    by = "group") %>% 
  dplyr::select(id, dplyr::everything(), -group)

df_w_id
#> # A tibble: 8 x 9
#>      id incident nm_first nm_last sex   birth_mnth birth_year add_num add_street
#>   <int>    <dbl> <chr>    <chr>   <chr>      <dbl>      <dbl>   <dbl> <chr>     
#> 1     1     1001 john     smith   m              9       1936     101 main      
#> 2     1     1002 john     smith   m              9       1936     101 main      
#> 3     2     1003 jane     smith   f              2       1937      14 elm       
#> 4     1     1004 jon      smith   m              9       1936     101 main      
#> 5     2     1005 jane     smith   f              3       1937      14 elm       
#> 6     4     1006 joy      smith   f              8       1941     101 main      
#> 7     1     1007 michael  smith   m              9       1936     101 main      
#> 8     3     1008 amy      jones   f              1       1947    1405 texas

And, now I have the desired results.

Sorry for the extremely long response. I just thought that you (or others) may be interested in seeing other ways that fastLink is being used in the wild. Also, I'd still appreciate your feedback on the Potential Solution I proposed above.

Thank you! Brad

tedenamorado commented 5 years ago

Hi Brad,

Thanks a lot for posting such a detailed example of how to use fastLink to find duplicates within a dataset and create a unique ID for those observations that have been duplicated. We can only hope that fastLink becomes of the applications you are working on.

Regarding your example, I think once you block (subset) the data by gender group, then most of the problems get solved in terms of matching the right units. Subsetting the data for males would take away the discriminatory power of the variables used to merge as first name would carry all the weight (it is the only one that varies).

Previously, we have addressed a similar question, please see issue #29. I think the data used in #29 would be a better test of your code -- for which I have little to add as it fits the problem at hand quite well.

Looking forward to hearing how fastLink fares in your applications and other suggestions you may have.

If anything, do not hesitate to let us know.

All my best,

Ted

mbcann01 commented 5 years ago

Thank you very much for the response, Ted. I'll take a look at #29 and give it another shot.