openjusticeok / ojoregex

A seperate package for maintaining the regex patterns that we use in our data normalization pipeline.
https://openjusticeok.github.io/ojoregex/
GNU General Public License v3.0
0 stars 0 forks source link

Replace str_ functions with stri_ from {stringi} #14

Closed brancengregory closed 3 months ago

brancengregory commented 5 months ago

Benchmarks show a decrease in run time by half for string detection We also get a small performance boost by specifying case insensitivity in the regex patter (?i) image

andrewjbe commented 3 months ago

Replaced the main string detection code as follows:

   apply_regex_pattern <- function(data, flag, regex_pattern) {
    data |>
     # Old {stringr} version:
     # dplyr::mutate(
     #   !!flag := stringr::str_detect(!!dplyr::sym(col_to_clean),
     #                                 stringr::regex(regex_pattern, ignore_case = TRUE))
     # )
     # New {stringi} version:
     dplyr::mutate(
       !!flag := stringi::stri_detect(str = !!dplyr::sym(col_to_clean),
                                      regex = paste0("(?i)", regex_pattern)) # Case insensitive
     )
 }

It doesn't really seem to save a ton of time but the results are exactly the same (checked with setdiff()).