Helper package for matching individuals across two datasets. This R package has been developed by the NHS Business Services Authority Data Science team.
The package is a series of functions that process and then match person information fields across two datasets, either in data frames or database tables.
The matching functions will identify whether a record containing person information can be matched to any records in a second dataset, either based on identical person information or a close match based on similar information.
The matching process focuses on comparing individuals based on four key pieces of information:
The quality of matching accuracy will be heavily influenced by the formatting of the input data, and there are processing functions available within the package to support this (see data preparation).
The personMatchR package has different functions available to handle matching whether the input data is held within data frames or via a connection to database tables:
You can install the development version of personMatchR from the NHSBSA Data Analytics GitHub with:
# install.packages("devtools")
devtools::install_github("nhsbsa-data-analytics/personMatchR")
In addition to function help files, some additional documentation has been included to provide detailed information about the package functions and some examples of the package in use:
The datasets being matched each require a forename, surname, DOB and postcode field to be present. In addition, each dataset also requires a unique identification field to be present. Users will have to generate such a field prior to using the matching function if one is not already present.
The following basic example shows how to match between two data-frames that are available within the documentation folder for this package.
df_A <- personMatchR::TEST_DF_A
head(df_A)
## ID SURNAME FORENAME POSTCODE DOB
## 1 1 O'Brien Richard AA9A 9AA 1942-03-25
## 2 2 Bloggs Joe A9A 9AA 1999-01-01
## 3 3 Watson David A9 9AA 2001-11-01
## 4 4 Neville Dylan A99 9AA 1941-05-24
df_B <- personMatchR::TEST_DF_B
head(df_B)
## ID SURNAME FORENAME POSTCODE DOB
## 1 1 O Brien Richard AA9A 9AA 1942-03-25
## 2 2 Bloggs Joseph A9A 9AA 1999-01-01
## 3 3 Brown John Z1 1ZZ 2001-11-01
## 4 4 Dylan Neville A99 9AA 1941-05-24
With such a small set of data it is easy to manually compare these two datasets, where it is clear that records are similar between both datasets:
When matching these datasets we would hope that records 1, 2 and 4 are matched.
We can pass the datasets to the calc_match_person function and review the output. For this example we will set parameters to only return the key fields from the matching, format the data prior to matching and include records without a match in the output:
library(personMatchR)
library(dplyr)
df_output <- personMatchR::calc_match_person(
df_one = df_A, # first dataset
id_one = ID, # unique id field from first dataset
forename_one = FORENAME, # forename field from first dataset
surname_one = SURNAME, # surname field from first dataset
dob_one = DOB, # date of birth field from first dataset
postcode_one = POSTCODE, # postcode field from first dataset
df_two = df_B, # second dataset
id_two = ID, # unique id field from second dataset
forename_two = FORENAME, # forename field from second dataset
surname_two = SURNAME, # surname field from second dataset
dob_two = DOB, # date of birth field from second dataset
postcode_two = POSTCODE, # postcode field from second dataset
output_type = "key", # only return the key match results
format_data = TRUE, # format input datasets prior to matching
inc_no_match = TRUE # return records from first dataset without matches
)
## # A tibble: 4 × 5
## DF1_INPUT_ID DF2_INPUT_ID MATCH_TYPE MATCH_COUNT MATCH_SCORE
## <chr> <chr> <chr> <dbl> <dbl>
## 1 2 2 Confident 1 0.96
## 2 1 1 Exact 1 1
## 3 4 4 Exact 1 1
## 4 3 <NA> No Match 0 0
The match results show exact matches for records 1 and 4 as the only differences were special characters and transposition of names. For record 2 the results show a confident match, as although not identical the names were similar enough to pass the confidence thresholds. As expected, record 3 does not produce any matches.
These fields in the output provide context for the match results:
In the example above the input data was passed through some formatting functions as part of the main matching package function call (format_data = TRUE). This option is only available when matching across data frames using the calc_match_person() function. However, the formatting functions could be called individually prior to calling the matching function.
There are three matching functions available, with different versions available for the database matching function:
The following code shows how these functions could be used to format the data prior to matching:
df_A = df_A %>%
format_date(date = DOB) %>%
format_name(name = FORENAME) %>%
format_name(name = SURNAME) %>%
format_postcode(id = ID, postcode = POSTCODE)
head(df_A)
## # A tibble: 4 × 5
## ID SURNAME FORENAME DOB POSTCODE
## <chr> <chr> <chr> <chr> <chr>
## 1 1 OBRIEN RICHARD 19420325 AA9A9AA
## 2 2 BLOGGS JOE 19990101 A9A9AA
## 3 3 WATSON DAVID 20011101 A99AA
## 4 4 NEVILLE DYLAN 19410524 A999AA
The calc_match_person_db() function does not offer the option to format the data prior to matching, with users required to carry out the processing as above beforehand.
It is strongly encouraged that users create new database tables after processing these fields. They can then use these freshly created tables as the matching function input, which will typically significantly improve runtime performance.
This parameter will determine the number of fields from each dataset returned in the output:
This parameter will determine whether or not the data is formatted as it is passed to the matching function. Formatting the data can help ensure both datasets are consistently formatted, accounting for things like case, removal of special characters, date of birth and postcode patterns. As formatted data is likely to have better matching outcomes it is strongly advised to apply this option.
This parameter will determine whether or not the output results will include details of records where no match could be found:
This parameter will determine whether or not the output results will include unique combinations or both versions of a potential match (e.g A=B & B=A).
The MATCH_COUNT value will represent the number of matches remaining after the data has been limited to unique combinations of potential matches.