pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
423 stars 37 forks source link

Add example for memory and speed gain when using lazy? #176

Closed etiennebacher closed 10 months ago

etiennebacher commented 1 year ago

So far, the "Get started" vignette explains that using lazy frames can lead to large gains in speed and memory, but the example doesn't really show the extent of these gains (which is acknowledged in the vignette).

Here's a small example of how polars can reorganize the query in order to optimize it (much faster and more memory-efficient to do filter-sort than sort-filter). Should something like this be included in the vignette or elsewhere?

library(polars)
library(ukbabynames)

# make the dataset
test <- data.frame(
  x = sample(ukbabynames::ukbabynames$name, 5*1e7, TRUE),
  y = sample(1:1000, 5*1e7, TRUE),
  z = sample(1:100, 5*1e7, TRUE)
)
readr::write_csv(test, "foo.csv")

eager <- pl$read_csv("foo.csv")
lazy <- pl$read_csv("foo.csv", lazy = TRUE)

lazy_query <- lazy$
  sort(pl$col("x"))$
  filter(
    pl$col("y") > 50 & 
      pl$col("x")$is_in(pl$lit(c("John", "Mary")))
  )

lazy_query$describe_plan()
#>   FILTER [([(col("y")) > (50f64)]) & (col("x").is_in([Series]))] FROM
#>     SORT BY [col("x")]
#>       CSV SCAN foo.csv
#>       PROJECT */3 COLUMNS
lazy_query$describe_optimized_plan()
#>   SORT BY [col("x")]
#>     CSV SCAN foo.csv
#>     PROJECT */3 COLUMNS
#>     SELECTION: [([(col("y").cast(Float64)) > (50f64)]) & (col("x").is_in([Series]))]

bench::mark(
  eager = eager$
    sort(pl$col("x"))$
    filter(
      pl$col("y") > 50 & 
      pl$col("x")$is_in(pl$lit(c("John","Mary")))
    ),
  lazy = lazy_query$collect()
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 eager        11.32s   11.32s    0.0883    13.4KB        0
#> 2 lazy          1.17s    1.17s    0.858       624B        0

Created on 2023-04-26 with reprex v2.0.2

vincentarelbundock commented 1 year ago

This is cool. Maybe we should have a a eparate articles with a few timing examples. Could start with a disclaimer paragraph saying that what follows are not formal benchmarks, and a link to DuckdB bench.

I think @grantmcdermott also had a couple impressive examples in stock.

grantmcdermott commented 1 year ago

I like the idea of a separate article with some striking examples. My own examples mostly use NYC taxi data, which is obvs well known for big data tasks/benchmarks.

For me, perhaps the most impressive thing about Polars (and other new data wrangling frameworks like DuckDB) is the total read+wrangle time, esp. for parquet-based datasets. This is obviously all lazy optimized.

Like, if you do a straight shootout between polars and data.table on some data operation---say, a grouped aggregation---the results are honestly pretty similar. But if you start to account for time spent reading from disk or setting keys, then there can be a stark divergence in time and memory overhead.

I know that including these "pre" operations has its own problems, particularly if you think they will only be one-time costs at the top of a script. But I don't think it's quite right to ignore them either. (It's a little bit like being told to ignore precompilation times in Julia pre-v1.9, despite this often taking longer than an entire benchmark script in other languages.)

eitsupi commented 1 year ago

Given that the arrow package has a significant amount of articles, it may be worth linking to the arrow package. (In short, I recognize arrow::open_dataset and polars::scan_parquet to be almost identical)

etiennebacher commented 1 year ago

I was more thinking of a small vignette highlighting the difference between eager and lazy as this is new for people like me who never used arrow and friends before, but there might be too much overlap with the "Lazy" section in the "Get started" vignette. For comparisons with other packages, I think pointing to the DuckDB benchmarks should be enough.

etiennebacher commented 1 year ago

By the way, the same operations with dplyr are made in about 6 seconds and consume 3.4GB (!) of memory. I know that polars is supposed to be super fast and memory efficient but 600B for lazy query against 3.4GB in dplyr seems just insane (basically dplyr would use 5M x more memory than the lazy query).

Is it possible that bench doesn't capture everything happening in the Rust side?

eitsupi commented 1 year ago

Is it possible that bench doesn't capture everything happening in the Rust side?

I believe it was mentioned in the bench's README or something, what we are doing outside of R is not recorded. This also applies to duckdb and arrow packages. We will probably see almost the same memory consumption when we perform the same operation in duckdb or arrow.

grantmcdermott commented 1 year ago

I wonder if something like cgmemtime would also work here? https://stackoverflow.com/questions/61376970/data-table-vs-dplyr-memory-use-revisited#61376971