Large speedup in `countrycode()`

etiennebacher commented 11 months ago

~Refactor of the way we search for regex matches on each input. Instead of running one grepl() per regex, I collapse all regexes into a single one by putting them into capture groups. Then, a single run of gregexpr() is needed and we can extract the matches from the information on the capture groups. I put a bit more explanations in the code, but you can also refer to this StackOverflow post.~

Instead of using nested sapply(), we can pass the whole vector of inputs (country names, iso, etc.) to grepl() directly.

There was also one ifelse() that took a lot of time but isn't needed when we only have one destination.

This leads to a very large speedup:

library(bench)

out <- cross::run(
  pkgs = c("vincentarelbundock/countrycode",
  "etiennebacher/countrycode@large-speedup"),
  ~{
    library(countrycode)

    test <- data.frame(
      origin = sample(codelist$country.name.en, 1e7, TRUE),
      destination = sample(codelist$country.name.en, 1e7, TRUE)
    )

    bench::mark(
      countrycode(test$origin, "country.name", "iso3c"),
      iterations = 10
    )
  }
)

tidyr::unnest(out, result) |>
  dplyr::select(pkg, median, mem_alloc) |>
  dplyr::mutate(pkg = ifelse(grepl("vincent", pkg), "main", "fork"))
#> # A tibble: 2 × 3
#> pkg median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 main 12.1s 1.17GB
#> 2 fork 2.23s 1.32GB

vincentarelbundock commented 11 months ago

Oooh, this looks awesome, thanks!

Frankly, it's a bit scary too, so I'll want to play with it a bit before merging. Unfortunately, it's the start of semester, so I may not get to this for a little while.

If

vincentarelbundock commented 11 months ago

If @cjyetman has time and interest (no pressure), he could also merge this after some testing.

etiennebacher commented 11 months ago

Frankly, it's a bit scary too, so I'll want to play with it a bit before merging

Yup, completely agree, the last thing you want is realize later that codes and names don't match. I'll see later if I can test more extensively

etiennebacher commented 11 months ago

FYI an improvement that makes the code much simpler was suggested to me on SO, I'll modify the PR

etiennebacher commented 11 months ago

New timing after the simplification:

#> # A tibble: 1 × 3
#>   pkg     median mem_alloc
#>   <chr> <bch:tm> <bch:byt>
#> 1 fork     1.18s     742MB

etiennebacher commented 11 months ago

@vincentarelbundock or @cjyetman can you enable the CI for all commits in the PR so that I know if they pass?

cjyetman commented 11 months ago

@vincentarelbundock or @cjyetman can you enable the CI for all commits in the PR so that I know if they pass?

I cannot (enable CI on commits from forks), but I have approved the CI to run on your most recent push.

etiennebacher commented 11 months ago

For information, if you're ok with adding a dependency on data.table (to have access to chmatch()) then it's possible to go down to 750ms and 550MB for the example above

vincentarelbundock commented 11 months ago

Got it.

I think it's best to stay dependency-free, though

vincentarelbundock commented 11 months ago

Looks like all the tests pass. I guess that's better than whatever impressionistic sense I could get from playing with different things. What do you think @cjyetman , should I just merge this?

cjyetman commented 11 months ago

I'll try to take a look at it today.

etiennebacher commented 11 months ago

I ran revdepcheck on the 31 reverse dependencies (both Imports and Suggests) and I found no new errors

cjyetman commented 11 months ago

@etiennebacher can you explain what cross::run() is so that I can replicate your example?

etiennebacher commented 11 months ago

It allows us to run the same code automatically with two (or more) versions of the same package. I specified the main branch and my fork at the beginning and then I give the code to run. cross::run() will automatically install both versions in a temp location, run the code for each version separately, and output the result in a nested table that I unnest at the end.

https://github.com/DavisVaughan/cross

cjyetman commented 11 months ago

This seems to work as expected, tests pass, build passes, so that seems ok. I can't exactly replicate the example because I don't know where cross::run() comes from, but loading different versions of countrycode in different sessions does seem to show a significant speed improvement with this example.

A few suggestions:

I would strongly prefer to see minor whitespace changes on unrelated lines removed from this PR to make it easier to review.
The ifelse() change starting on L207 I would prefer to see done in a separate PR because it's conceptually a separate idea/improvement.

vincentarelbundock commented 11 months ago

Thanks a ton for the review @cjyetman, I appreciate your time.

I also agree with the comments about white space and separate PRs for different concepts.

That said, it's probably fine to merge this now in order to avoid more busywork for Etienne. So @etiennebacher, if you want to add your name as a contributor (your choice), I can merge the PR.

A 5x speedup is nothing to scoff at. Very nice! Thanks!

cjyetman commented 11 months ago

It allows us to run the same code automatically with two (or more) versions of the same package. I specified the main branch and my fork at the beginning and then I give the code to run. cross::run() will automatically install both versions in a temp location, run the code for each version separately, and output the result in a nested table that I unnest at the end.

https://github.com/DavisVaughan/cross

Thanks for the link, I was not finding {cross} on CRAN.

cjyetman commented 11 months ago

Note that if multiple destinations are used, the speed-up ~disappears~ is reduced and the memory usage is the same, but they still produce the same result:

  library(bench)

  out <- cross::run(
    pkgs = c("vincentarelbundock/countrycode",
             "etiennebacher/countrycode@large-speedup"),
    ~{
      library(countrycode)

      sourcevar <- sample(codelist$country.name.en, 1e7, TRUE)

      bench::mark(
        countrycode(sourcevar, "country.name", c("iso3c", "iso3n")),
        iterations = 10
      )
    }
  )

  tidyr::unnest(out, result) |>
    dplyr::select(pkg, median, mem_alloc) |>
    dplyr::mutate(pkg = ifelse(grepl("vincent", pkg), "main", "fork"))

#> # A tibble: 2 × 3
#>   pkg     median mem_alloc
#>   <chr> <bch:tm> <bch:byt>
#> 1 main     5.95s    2.45GB
#> 2 fork     4.56s    2.45GB

vincentarelbundock commented 11 months ago

Thanks to both of you (and especially Etienne!). Merged!

vincentarelbundock / countrycode

Large speedup in `countrycode()` #341