runapp-aus / strayr

A catalogue of ready-to-use ABS coding structures. Package documentation can be found here: https://runapp-aus.github.io/strayr/
45 stars 14 forks source link

`parse_income_range` is too slow #68

Open wfmackey opened 2 years ago

wfmackey commented 2 years ago

The parse_income_range function is very inefficient/slow, and becomes unusable for larger (>1m) rows.

bennywee commented 2 years ago

@wfmackey do you have a minimum reproducible example using the datasets in strayr (or any reproducible example)?

Happy to look into this.

wfmackey commented 2 years ago

Sure thing @bennywee , thanks! The way I implemented it (and as it stands now) is slow because it doesn't use vectorised versions of str_ functions but rather purrr::map_ functions. It's so bad that I wrote an alternative function for a project I am doing at the moment which is ~200 times faster. A reprex is below for the current parse_income_range function (11.2 seconds for just 10e4 values(!!)) and the alternative (0.05 seconds).

The alternative version as it is now won't work for limit = "mid", and hasn't been tested on non-Census 2016 values.

library(microbenchmark)
library(tidyverse)
library(strayr)

# Make data --------------------------------------------------------------------

census_income_levels <- c(
  "Negative income",
  "Nil income",
  "$1-$149 ($1-$7,799)",
  "$150-$299 ($7,800-$15,599)",
  "$300-$399 ($15,600-$20,799)",
  "$400-$499 ($20,800-$25,999)",
  "$500-$649 ($26,000-$33,799)",
  "$650-$799 ($33,800-$41,599)",
  "$800-$999 ($41,600-$51,999)",
  "$1,000-$1,249 ($52,000-$64,999)",
  "$1,250-$1,499 ($65,000-$77,999)",
  "$1,500-$1,749 ($78,000-$90,999)",
  "$1,750-$1,999 ($91,000-$103,999)",
  "$2,000-$2,999 ($104,000-$155,999)",
  "$3,000 or more ($156,000 or more)",
  "Not stated",
  "Not applicable"
)

N <- 1e5

dummy_data <- tibble(raw_income = sample(census_income_levels, 
                                         size = N,
                                         replace = TRUE),
                     count = runif(N, 0, 1000))

# Alternative function (this wont work for limit = "mid") ----------------------
parse_income_range2 <- function(x, limit, 
                                max_income = 5000, 
                                annualise = TRUE) {

  if (limit == "lower")  reg <- "\\$[0-9]*"
  if (limit == "upper") reg <- "-\\$[0-9]*"

  orig <- x

  x <- x %>% 
    str_remove_all(",") %>% 
    str_extract(reg) %>% 
    str_remove("-?\\$") %>% 
    as.numeric() %>% 
    if_else(str_detect(orig, "(Negative)|(Nil)"), 0, .)

  if (limit == "upper") x <- if_else(str_detect(orig, "or more"), 5000, x)

  if (isTRUE(annualise)) x <- x * 52
  if (isTRUE(annualise) & limit == "upper") x <- ceiling(x/100) * 100

  return(x)
}

# Test -------------------------------------------------------------------------

microbenchmark(
  strayr = dummy_data %>% 
  mutate(income_lower = parse_income_range(raw_income, "lower")),
  alternative = dummy_data %>% 
    mutate(income_lower = parse_income_range2(raw_income, "lower")),
times = 1)
#> Unit: milliseconds
#>         expr         min          lq        mean      median          uq
#>       strayr 105148.0665 105148.0665 105148.0665 105148.0665 105148.0665
#>  alternative    195.3662    195.3662    195.3662    195.3662    195.3662
#>          max neval
#>  105148.0665     1
#>     195.3662     1

Created on 2021-11-12 by the reprex package (v2.0.1)

bennywee commented 2 years ago

Thanks @wfmackey this is great. I've had a look at both your function and the parsed_income_range function.

I noticed that both functions perform the string manipulation on each element of the raw_income column. The raw_income column has many repeated values in the dummy_data (and I assume also in real data where this transformation is being used).

I think we can speed up the parsed_income_range function without changing the underlying transformations/logic by reducing the number of times the function has to make a calculation. That is, perform the string transformation only once for each income level as opposed to doing it repeatedly.

A rough idea is to take the unique income levels, perform the transformation, then join it back to the original dataframe. I've used your alternative function as an example below.

I tried going this with the original function but I'm having a bit more trouble with getting it to work with purr. I'm hoping to preserve the overall logic of the function without re-writing too much - but happy to hear your feedback on this approach!

parse_income_range_bw <- function(x, limit, 
                                 max_income = 5000, 
                                 annualise = TRUE) {

   orig <- as.data.frame(x)
   colnames(orig) = c('raw_income')
   x <- x %>% unique()

  if (limit == "lower")  reg <- "\\$[0-9]*"
  if (limit == "upper") reg <- "-\\$[0-9]*"

  parsed_income <- x %>% 
    str_remove_all(",") %>% 
    str_extract(reg) %>% 
    str_remove("-?\\$") %>% 
    as.numeric() %>% 
    if_else(str_detect(., "(Negative)|(Nil)"), 0, .)

  if (limit == "upper") parsed_income <- if_else(str_detect(x, "or more"), 5000, parsed_income)
  if (isTRUE(annualise)) parsed_income <- parsed_income * 52
  if (isTRUE(annualise) & limit == "upper") parsed_income <- ceiling(parsed_income/100) * 100

  # Create lookup table 
  lookup <- as.data.frame(cbind(x, parsed_income))

  # Join lookup table to original column and select parsed column
  x2 <- orig %>% 
    left_join(lookup, by = c('raw_income' = 'x')) %>% 
    select(parsed_income)

  return(x2)
}

# > microbenchmark(
# +   strayr = dummy_data %>% 
# +     mutate(income_lower = strayr::parse_income_range(raw_income, "lower")),
# +   alternative = dummy_data %>% 
# +     mutate(income_lower = parse_income_range2(raw_income, "lower")),
# +   bw = dummy_data %>% 
# +     mutate(income_lower = parse_income_range_bw(raw_income, "lower")),
# +   times = 1)
# Unit: milliseconds
#         expr        min         lq       mean     median         uq        max neval
#       strayr 62926.9909 62926.9909 62926.9909 62926.9909 62926.9909 62926.9909     1
#  alternative   234.6231   234.6231   234.6231   234.6231   234.6231   234.6231     1
#           bw    15.4436    15.4436    15.4436    15.4436    15.4436    15.4436     1
bennywee commented 2 years ago

So instead of re-writing parse_income_range, I created a wrapper function that follows the same logic as above. I ran the tests found in test-parse_income.R using the parse_income_wrapper function which all passed. The results below show a speed improvement too!

parse_income_wrapper <- function(income_range,
                     limit = "lower",
                     max_income = Inf,
                     is_zero = c("Nil income"),
                     negative_as_zero = TRUE,
                     dollar_prefix = "$",
                     .silent = TRUE){

  orig <- as.data.frame(income_range)
  colnames(orig) = c('raw_income')

  lookup <- orig %>% 
    unique() %>%
    mutate(transformed = strayr::parse_income_range(income_range = raw_income,
                                                    limit,
                                                    max_income,
                                                    is_zero,
                                                    negative_as_zero,
                                                    dollar_prefix,
                                                    .silent)) 

  x1 <- orig %>% 
    left_join(lookup, by = 'raw_income') %>% 
    select(transformed) %>% 
    unlist() %>% 
    as.vector()

  return(x1)
}

#> microbenchmark(
# +   strayr = dummy_data %>% 
# +     mutate(income_lower = strayr::parse_income_range(raw_income, "lower")),
# +   alternative = dummy_data %>% 
# +     mutate(income_lower = parse_income_range2(raw_income, "lower")),
# +   wrapper_bw = dummy_data %>% 
# +     mutate(income_lower = parse_income_wrapper(raw_income)),
# +   times = 1)
# Unit: milliseconds
#         expr        min         lq       mean     median         uq        max neval
#       strayr 61122.2708 61122.2708 61122.2708 61122.2708 61122.2708 61122.2708     1
#  alternative   245.8460   245.8460   245.8460   245.8460   245.8460   245.8460     1
#   wrapper_bw   106.9508   106.9508   106.9508   106.9508   106.9508   106.9508     1
bennywee commented 2 years ago

@wfmackey - curious about your thoughts on this. @cynthiahqy and I were having a discussion about implementing this solution.

The question we discussed was how should this wrapper function be implemented in the package (or at all). parse_income_range works as intended, so there is nothing technically wrong here. On the other hand, there is an opportunity to improve the speed of the function, but we feel unsure how much of the work should be automated by the package since it performs operations unknown to the user. We discussed some options below:

1) Write a vignette on how best to use the current parse_income_range function to get the best performance. That is, get the user to create the lookup table themselves so they are fully aware of the data manipulations occurring.

2) Write a separate wrapper function as in the above example (parse_income_wrapper)

3) Have the current parse_income_range default to creating the lookup table and performing the distinct() and the join automatically.

wfmackey commented 2 years ago

This is great @bennywee and @cynthiahqy, and sorry for the slow reply to your initial suggestion! (we should really look into a workflow that isn't held up by my absence -- sorry).

The lookup table + _join approach makes much more sense! Good idea. We should implement that.

The original parse_income_range function is terribly implemented as it is (I can say that because I wrote it lol). I think it gets tripped up by trying to allow for mid-range calculations for each element, which slows everything else down.

I agree that we should err on the side of caution about performing tasks that are (and may remain) unknown to the user. Being conservative here is the preferred approach for this package -- eg better to have max_income = Inf and forcing users to provide amax_income rather than defaulting to some real numeric value.

Note that I don't see how your new implementation adds to the behind-the-scenes changes/heuristics compared to the original parse_income_range? (I may well be missing something!) So not sure what your point (2) would add here.

Either way, your ideas for transparency are useful. Re (1) a vignette is always good -- and we should definitely include one in the strayr pkgdown site! -- but we shouldn't assume anyone will read it. Maybe we should follow the approach of (eg) read_csv by providing a clear message of the conversion table after the function is performed. For example, adding a message to the parse_income_range_bw function:

...

  print_lookup <- paste0("  ",
    lookup$x, "\t\twith\t", lookup$parsed_income, "\n"
  )

  message("Replacing:\n", 
          "  character",
          "\t\twith\t", 
          "numeric, ", limit, " limit", 
          {if (isTRUE(annualise)) ", annualised (*52)"},
          "\n", 
          print_lookup)
...

would give:

> parse_income_range_bw(incomes, limit = "upper")
Replacing:
  character     with    numeric, upper limit, annualised (*52)
  $1-$149       with    7800
  $150-$300     with    15600
  $3000 or more     with    260000

[1]   7800  15600 260000

Maybe this message (warning?) could be turned off with .silent, or with a global option. But it would force the user to understand that conversion that was happening.

What do you @bennywee and @cynthiahqy think? If you're happy with this rough approach and you'd be happy to throw it into a PR (as a replacement for the current parse_income_range), that would be great! Otherwise I'm happy to do it at some point later.

Excellent ideas -- I am very much looking forward to using the new function soon!

HughParsonage commented 2 years ago

Consider the following pure R function for the lower limit:

do_parse_income <- function(x, ux = NULL) {
  if (is.null(ux)) {
    ux <- unique(x)
  }
  unx <- gsub("[$,]", "", ux)
  u_out <- suppressWarnings(as.integer(sub("^[^0-9]*([0-9]+)[^0-9]*.*$", "\\1", unx)))
  u_out[match(x, ux)]
}

On my machine it is about 5x faster than the fastest solution above and 15,000 times faster than the current function. To enable the features provided by the other arguments would be a matter of switching the sub pattern and perhaps adding a few lines like u_out[<cond>] <- <value> at the end.

Crucially: a regex should suffice for extracting the numbers and may in fact make it easier to explain in the documentation; and using match instead of _join almost always makes the code faster and more concise.