nhsbsa-data-analytics / personMatchR

Helper package for matching individuals across two datasets
Apache License 2.0
0 stars 0 forks source link

Include warning when choosing formatting parameter in db matching function #27

Closed steven-buckley closed 2 years ago

steven-buckley commented 2 years ago

We have found including the formatting as part of the calc_match_patients_db() function can have a performance issue.

The dbplyr code stacks as part of a lazy query and tries to run all parts when the data is collected. This seems to create some unusual queries when doing both formatting and matching as part of the same run.

Run a few test cases of different sizes to see what the performance is and then incorporate some form of warning if they user chooses to format when we think it may pose an issue: "Warning: Formatting as part of the matching may have a big impact on performance on large datasets and therefore we would recommend running the formatting functions on the dataset prior to the matching function. Abort function call (Y/N)?"

steven-buckley commented 2 years ago

Test cases are suggesting that the calling the formatting functions as part of the main matching function is having a huge impact on performance.

Even with a tiny basic dataset of just 10 records, running the formatting function pushes run time from ~8s to ~200s.

Rather than just recommending users format outside of the matching function, the better cause of action would be to remove the option entirely.

steven-buckley commented 2 years ago

Closed as part of pull request 29