pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
405 stars 35 forks source link

Include a performance vignette? #188

Closed etiennebacher closed 8 months ago

etiennebacher commented 1 year ago

@sorhawell @eitsupi @vincentarelbundock @grantmcdermott I started a vignette on polars performance, not to compare it to other packages but rather to present a few "good practices" (?) to use its full capabilities:

I'm still a beginner in polars and in data wrangling with larger-than-RAM data so there might be some things to correct/complete here. Also, I mostly wrote this for me to have some explanations somewhere and because it might be useful if I end up teaching this, but it doesn't have to be included as a vignette.

What do you think about this?

Close #176

eitsupi commented 1 year ago

Since it maybe difficult to run benchmarks on CI, I think we need to investigate how other repositories include benchmarks in their articles. I don't know of many examples, but something like https://github.com/tidyverse/vroom for example.

vincentarelbundock commented 1 year ago

May be relevant: https://github.com/pola-rs/tpch

sorhawell commented 1 year ago

May be relevant: https://github.com/pola-rs/tpch

I was playing around a bit with tpch about August last year. But a lot of features were missing in r-polars back then. Some of test datasets did not compile out-of-the-box and required some hand held fixing on my machine. Maybe tpch has become more ergonomic now.

I think a tpch benchmark would be the ultimate approval that r-polars is on par with py-polars-

sorhawell commented 1 year ago

@etiennebacher I'm very positive about this draft :)

vincentarelbundock commented 1 year ago

Maybe you can add something ultra simple but fun like this, and then point readers to the DuckDB benchmarks for more serious stuff. The idea is would be to give readers an early "Wow!".

library(bench)
library(dplyr)
library(polars)
library(data.table)

N = 1e7
df = data.frame(matrix(runif(25 * N), nrow = N))
df$letters = sample(letters, N, replace = TRUE)

df_dt = data.table(df)
df_pl = pl$DataFrame(df)

# comparison
bench::mark(
    "base" = by(df, df$letters, \(x) colMeans(x[, -26])),
    "dplyr" = df %>% group_by(letters) %>% summarise_all(mean),
    "data.table" = df_dt[, lapply(.SD, mean), by = "letters"],
    "polars" = df_pl$groupby("letters")$mean(),
    check = FALSE,
    relative = TRUE
)
#   expression   min median `itr/sec` mem_alloc `gc/sec`
#   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
# 1 base       16.1   15.6       1        8241.      Inf
# 2 dplyr       9.66   9.35      1.67     3826.      Inf
# 3 data.table  2.29   2.26      6.92      634.      NaN
# 4 polars      1      1        15.5         1       NaN
grantmcdermott commented 1 year ago

I didn't want to spam everyone's inbox—so let me know if other would like to join too—but I invited @vincentarelbundock and @etiennebacher to a private repo that houses (an adapted subset of) benchmarks that I keep for myself on some common data tasks across a variety of languages and libraries. Feel free to poke around etc. I also have timings for larger datasets, but this ends up being a bottleneck for some languages (cough Stata cough).

grantmcdermott commented 1 year ago

I'm still a beginner in polars and in data wrangling with larger-than-RAM data so there might be some things to correct/complete here.

AFAIK r-polars does not support streaming yet. See the py-polars handbook for some simple examples. tl;dr just end your query with collect(steaming=True).

etiennebacher commented 1 year ago

Thanks all, FYI if you're interested feel free to push changes directly to this PR

eitsupi commented 1 year ago

I am reluctant to make comparisons with dplyr or data.table here. (Is there any reason to include dplyr and data.table but not Acero (arrow) and duckdb?)

vincentarelbundock commented 1 year ago

I am reluctant to make comparisons with dplyr or data.table here.

Well, one goal here is obviously to convince R users that it's worth it for them to try out polars. One way to do that is to show that it'll be faster than what they currently use, and 98% of R users currently rely on base, dplyr, or data.table. So in pure "marketing" terms, it seems pretty important to have this there. And if we clearly note that these are not extensive rigorous benchmarks, and point to the DuckDB page, then it is "honest" marketing that we can feel good about.

Is there any reason to include dplyr and data.table but not Acero (arrow) and duckdb?

No, I think those should be added too if it's easy. I would also be curious to know what people think the benefits of polars are over duckdb (assuming the performance is similar.)

eitsupi commented 1 year ago

For example, when I tried Acero and duckdb in my environment, I got the following results. However, duckdb converts the results to R DataFrame, which may not be fair compared to Acero and polars, which require an additional cost when converting to DataFrame.

library(bench)
library(data.table)
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
library(duckdb)
#> Loading required package: DBI
#>
#> Attaching package: 'duckdb'
#> The following object is masked from 'package:dplyr':
#>
#>     sql
library(polars)

N = 1e7
set.seed(1)
df = data.frame(matrix(runif(25 * N), nrow = N))
df$letters = sample(letters, N, replace = TRUE)

df_dt = data.table(df)
at = as_arrow_table(df)
df_pl = pl$DataFrame(df)

con = DBI::dbConnect(duckdb::duckdb(), ":memory:")
duckdb_register(con, "df", df)

# comparison
bench::mark(
    "data.table" = df_dt[, lapply(.SD, mean), by = "letters"],
    "Acero" = at |> group_by(letters) |> summarise(across(!letters, ~ mean(.x, na.rm = TRUE))) |> compute(),
    "duckdb" = duckdb::sql("FROM df SELECT letters, avg(COLUMNS(x -> NOT suffix(x, 'letters'))) GROUP BY letters", con),
    "polars" = df_pl$groupby("letters")$mean(),
    check = FALSE,
    relative = TRUE
)
#> # A tibble: 4 × 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 data.table  6.68   6.29      1      35566.       NaN
#> 2 Acero       1.39   1.31      4.81     821.       Inf
#> 3 duckdb      1.71   1.61      3.89       1        NaN
#> 4 polars      1      1         6.29      56.2      NaN

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.0 (2023-04-21)
#>  os       Ubuntu 22.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2023-05-05
#>  pandoc   3.1.2 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version  date (UTC) lib source
#>  arrow       * 11.0.0.3 2023-03-08 [1] RSPM
#>  assertthat    0.2.1    2019-03-21 [1] RSPM
#>  bench       * 1.1.2    2021-11-30 [1] RSPM
#>  bit           4.0.5    2022-11-15 [1] RSPM (R 4.3.0)
#>  bit64         4.0.5    2020-08-30 [1] RSPM (R 4.3.0)
#>  cli           3.6.1    2023-03-23 [1] RSPM
#>  data.table  * 1.14.8   2023-02-17 [1] RSPM
#>  DBI         * 1.1.3    2022-06-18 [1] RSPM (R 4.3.0)
#>  digest        0.6.31   2022-12-11 [1] RSPM
#>  dplyr       * 1.1.2    2023-04-20 [1] RSPM (R 4.3.0)
#>  duckdb      * 0.8.0    2023-05-05 [1] https://duckdb.r-universe.dev (R 4.3.0)
#>  evaluate      0.20     2023-01-17 [1] RSPM
#>  fansi         1.0.4    2023-01-22 [1] RSPM
#>  fastmap       1.1.1    2023-02-24 [1] RSPM
#>  fs            1.6.2    2023-04-25 [1] RSPM (R 4.3.0)
#>  generics      0.1.3    2022-07-05 [1] RSPM (R 4.3.0)
#>  glue          1.6.2    2022-02-24 [1] RSPM
#>  htmltools     0.5.5    2023-03-23 [1] RSPM
#>  knitr         1.42     2023-01-25 [1] RSPM
#>  lifecycle     1.0.3    2022-10-07 [1] RSPM
#>  magrittr      2.0.3    2022-03-30 [1] RSPM
#>  pillar        1.9.0    2023-03-22 [1] RSPM
#>  pkgconfig     2.0.3    2019-09-22 [1] RSPM
#>  polars      * 0.6.0    2023-05-04 [1] local
#>  profmem       0.6.0    2020-12-13 [1] RSPM
#>  purrr         1.0.1    2023-01-10 [1] RSPM
#>  R.cache       0.16.0   2022-07-21 [1] RSPM
#>  R.methodsS3   1.8.2    2022-06-13 [1] RSPM
#>  R.oo          1.25.0   2022-06-12 [1] RSPM
#>  R.utils       2.12.2   2022-11-11 [1] RSPM
#>  R6            2.5.1    2021-08-19 [1] RSPM
#>  reprex        2.0.2    2022-08-17 [1] RSPM
#>  rlang         1.1.0    2023-03-14 [1] RSPM
#>  rmarkdown     2.21     2023-03-26 [1] RSPM
#>  sessioninfo   1.2.2    2021-12-06 [1] RSPM
#>  styler        1.9.1    2023-03-04 [1] RSPM
#>  tibble        3.2.1    2023-03-20 [1] RSPM
#>  tidyselect    1.2.0    2022-10-10 [1] RSPM (R 4.3.0)
#>  utf8          1.2.3    2023-01-31 [1] RSPM
#>  vctrs         0.6.2    2023-04-19 [1] RSPM
#>  withr         2.5.0    2022-03-03 [1] RSPM
#>  xfun          0.39     2023-04-20 [1] RSPM
#>  yaml          2.3.7    2023-01-23 [1] RSPM
#>
#>  [1] /usr/local/lib/R/site-library
#>  [2] /usr/local/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────

Created on 2023-05-05 with reprex v2.0.2

eitsupi commented 1 year ago

Here are what I consider to be the rough pros and cons:

Acero

Pros

Cons

DuckDB

Pros

Cons

Polars

Pros

Cons

eitsupi commented 1 year ago

I really don't think it is a good idea to make a bare dplyr comparison here, because in my opinion dplyr users can achieve higher speeds by simply switching to data.table (via dtplyr) or Acero (via arrow) or duckdb (via dbplyr) as a backend. (Note that these different backends are described in dplyr's README)

vincentarelbundock commented 1 year ago

Why don't we just add a dtplyr example, then?

One other benefit of polars, I think, is the parallelism in syntax across R, python, and Rust, which facilitates multilingual projects and teams

tdhock commented 1 year ago

I would recommend using asymptotic benchmarks, which means measuring time and memory for data size N values increasing on a log scale. I have a package https://github.com/tdhock/atime that makes this easy. These benchmarks are much more convincing than single N, which can be misleading (which N is relevant to test may depend on the particular problem/hardware so much more informative/convincing to see result for several N).

etiennebacher commented 1 year ago

I don't think we should include benchmarks with data.table/dplyr/arrow, etc., or at least not in this vignette (so we could change its name).

To me, the objective of this vignette is not to compare polars to other packages or tools because our benchmarks will never be as comprehensive as those run by duckdb. Also, if we start doing this, then we must make a lot of choices (including those discussed above): should we count data reading in the timing? should we use keyed data.tables? should we compare to arrow, duckdb? how many observations should we keep? etc.

I think it's more important here to focus on how one can use polars' full capabilities, because it's not something that the average R user knows (e.g I guess few R users know the difference between eager and lazy execution). If we want to advertise the speed of polars, couldn't we just take a graph from duckdb's benchmarks?


@tdhock thanks for the link, I think bench::press() does something similar?

tdhock commented 1 year ago

yes, bench::press does something similar, and that is discussed in the Related work section of the atime README, https://github.com/tdhock/atime#related-work bench::press does something similar, and is more flexible because it can do multi-dimensional grid search (not only over a single size N argument as atime does). However it can not store results if check=FALSE, results must be equal if check=TRUE, and there is no way to easily specify a time limit which stops for larger sizes (like seconds.limit argument in atime).

grantmcdermott commented 1 year ago

One other (tbc?) Pro for Polars is that multithreading automatically works on MacOS.

I might be missing something about the R-universe build process, but enabling multithreading in other high performance R libraries can be a bit of a pain. That's because these are C/C++ based and the OpenMP toolchain has to be installed and then linked to manually. (Basically, you have to specify a bunch of, e.g., C++ flags in your Makevars and then build from source instead of installing binaries.)

eitsupi commented 1 year ago

I might be missing something about the R-universe build process

Do you mention that SIMD is disabled in the R-universe builds? https://github.com/pola-rs/r-polars/pull/78#issuecomment-1479702330

sorhawell commented 1 year ago

Do you mention that SIMD is disabled in the R-universe builds?

We do not :/ but should. I know of no benchmarks yet describing the performance difference.

etiennebacher commented 9 months ago

Can someone review the content of this vignette? I included the .md file so that it's easier to review from Github but we'll need to remove it before merging since it's not expected by R CMD check.

etiennebacher commented 8 months ago

Thanks for the review @eitsupi