nhsbsa-data-analytics / personMatchR

Helper package for matching individuals across two datasets
Apache License 2.0
0 stars 0 forks source link

personMatchR

Helper package for matching individuals across two datasets. This R package has been developed by the NHS Business Services Authority Data Science team.

The package is a series of functions that process and then match person information fields across two datasets, either in data frames or database tables.

The matching functions will identify whether a record containing person information can be matched to any records in a second dataset, either based on identical person information or a close match based on similar information.

The matching process focuses on comparing individuals based on four key pieces of information:

The quality of matching accuracy will be heavily influenced by the formatting of the input data, and there are processing functions available within the package to support this (see data preparation).

The personMatchR package has different functions available to handle matching whether the input data is held within data frames or via a connection to database tables:

Installation

You can install the development version of personMatchR from the NHSBSA Data Analytics GitHub with:

# install.packages("devtools")
devtools::install_github("nhsbsa-data-analytics/personMatchR")

Documentation

In addition to function help files, some additional documentation has been included to provide detailed information about the package functions and some examples of the package in use:

Requirements

The datasets being matched each require a forename, surname, DOB and postcode field to be present. In addition, each dataset also requires a unique identification field to be present. Users will have to generate such a field prior to using the matching function if one is not already present.

Example

The following basic example shows how to match between two data-frames that are available within the documentation folder for this package.

df_A <- personMatchR::TEST_DF_A
head(df_A)
##   ID SURNAME FORENAME POSTCODE        DOB
## 1  1 O'Brien  Richard AA9A 9AA 1942-03-25
## 2  2  Bloggs      Joe  A9A 9AA 1999-01-01
## 3  3  Watson    David   A9 9AA 2001-11-01
## 4  4 Neville    Dylan  A99 9AA 1941-05-24
df_B <- personMatchR::TEST_DF_B
head(df_B)
##   ID SURNAME FORENAME POSTCODE        DOB
## 1  1 O Brien  Richard AA9A 9AA 1942-03-25
## 2  2  Bloggs   Joseph  A9A 9AA 1999-01-01
## 3  3   Brown     John   Z1 1ZZ 2001-11-01
## 4  4   Dylan  Neville  A99 9AA 1941-05-24

With such a small set of data it is easy to manually compare these two datasets, where it is clear that records are similar between both datasets:

When matching these datasets we would hope that records 1, 2 and 4 are matched.

We can pass the datasets to the calc_match_person function and review the output. For this example we will set parameters to only return the key fields from the matching, format the data prior to matching and include records without a match in the output:

library(personMatchR)
library(dplyr)
df_output <- personMatchR::calc_match_person(
  df_one = df_A, # first dataset
  id_one = ID, # unique id field from first dataset
  forename_one = FORENAME, # forename field from first dataset
  surname_one = SURNAME, # surname field from first dataset
  dob_one = DOB, # date of birth field from first dataset
  postcode_one = POSTCODE, # postcode field from first dataset
  df_two = df_B, # second dataset
  id_two = ID, # unique id field from second dataset
  forename_two = FORENAME, # forename field from second dataset
  surname_two = SURNAME, # surname field from second dataset
  dob_two = DOB, # date of birth field from second dataset
  postcode_two = POSTCODE, # postcode field from second dataset
  output_type = "key", # only return the key match results
  format_data = TRUE, # format input datasets prior to matching
  inc_no_match = TRUE # return records from first dataset without matches
)
## # A tibble: 4 × 5
##   DF1_INPUT_ID DF2_INPUT_ID MATCH_TYPE MATCH_COUNT MATCH_SCORE
##   <chr>        <chr>        <chr>            <dbl>       <dbl>
## 1 2            2            Confident            1        0.96
## 2 1            1            Exact                1        1   
## 3 4            4            Exact                1        1   
## 4 3            <NA>         No Match             0        0

The match results show exact matches for records 1 and 4 as the only differences were special characters and transposition of names. For record 2 the results show a confident match, as although not identical the names were similar enough to pass the confidence thresholds. As expected, record 3 does not produce any matches.

Understanding match output: MATCH_COUNT & MATCH_SCORE

These fields in the output provide context for the match results:

Data preparation

In the example above the input data was passed through some formatting functions as part of the main matching package function call (format_data = TRUE). This option is only available when matching across data frames using the calc_match_person() function. However, the formatting functions could be called individually prior to calling the matching function.

There are three matching functions available, with different versions available for the database matching function:

The following code shows how these functions could be used to format the data prior to matching:

df_A = df_A %>% 
  format_date(date = DOB) %>% 
  format_name(name = FORENAME) %>% 
  format_name(name = SURNAME) %>% 
  format_postcode(id = ID, postcode = POSTCODE)
head(df_A)
## # A tibble: 4 × 5
##   ID    SURNAME FORENAME DOB      POSTCODE
##   <chr> <chr>   <chr>    <chr>    <chr>   
## 1 1     OBRIEN  RICHARD  19420325 AA9A9AA 
## 2 2     BLOGGS  JOE      19990101 A9A9AA  
## 3 3     WATSON  DAVID    20011101 A99AA   
## 4 4     NEVILLE DYLAN    19410524 A999AA

The calc_match_person_db() function does not offer the option to format the data prior to matching, with users required to carry out the processing as above beforehand.

It is strongly encouraged that users create new database tables after processing these fields. They can then use these freshly created tables as the matching function input, which will typically significantly improve runtime performance.

Match function parameter: output_type

This parameter will determine the number of fields from each dataset returned in the output:

Match function parameter: format_data (not used for database match functions)

This parameter will determine whether or not the data is formatted as it is passed to the matching function. Formatting the data can help ensure both datasets are consistently formatted, accounting for things like case, removal of special characters, date of birth and postcode patterns. As formatted data is likely to have better matching outcomes it is strongly advised to apply this option.

Match function parameter: inc_no_match

This parameter will determine whether or not the output results will include details of records where no match could be found:

Match function parameter: unique_combinations_only (self join only)

This parameter will determine whether or not the output results will include unique combinations or both versions of a potential match (e.g A=B & B=A).

The MATCH_COUNT value will represent the number of matches remaining after the data has been limited to unique combinations of potential matches.