Closed etiennebacher closed 10 months ago
This is cool. Maybe we should have a a eparate articles with a few timing examples. Could start with a disclaimer paragraph saying that what follows are not formal benchmarks, and a link to DuckdB bench.
I think @grantmcdermott also had a couple impressive examples in stock.
I like the idea of a separate article with some striking examples. My own examples mostly use NYC taxi data, which is obvs well known for big data tasks/benchmarks.
For me, perhaps the most impressive thing about Polars (and other new data wrangling frameworks like DuckDB) is the total read+wrangle time, esp. for parquet-based datasets. This is obviously all lazy optimized.
Like, if you do a straight shootout between polars and data.table on some data operation---say, a grouped aggregation---the results are honestly pretty similar. But if you start to account for time spent reading from disk or setting keys, then there can be a stark divergence in time and memory overhead.
I know that including these "pre" operations has its own problems, particularly if you think they will only be one-time costs at the top of a script. But I don't think it's quite right to ignore them either. (It's a little bit like being told to ignore precompilation times in Julia pre-v1.9, despite this often taking longer than an entire benchmark script in other languages.)
Given that the arrow
package has a significant amount of articles, it may be worth linking to the arrow
package.
(In short, I recognize arrow::open_dataset
and polars::scan_parquet
to be almost identical)
I was more thinking of a small vignette highlighting the difference between eager and lazy as this is new for people like me who never used arrow
and friends before, but there might be too much overlap with the "Lazy" section in the "Get started" vignette. For comparisons with other packages, I think pointing to the DuckDB benchmarks should be enough.
By the way, the same operations with dplyr
are made in about 6 seconds and consume 3.4GB (!) of memory. I know that polars is supposed to be super fast and memory efficient but 600B for lazy query against 3.4GB in dplyr seems just insane (basically dplyr
would use 5M x more memory than the lazy query).
Is it possible that bench
doesn't capture everything happening in the Rust side?
Is it possible that
bench
doesn't capture everything happening in the Rust side?
I believe it was mentioned in the bench
's README or something, what we are doing outside of R is not recorded.
This also applies to duckdb
and arrow
packages.
We will probably see almost the same memory consumption when we perform the same operation in duckdb
or arrow
.
I wonder if something like cgmemtime would also work here? https://stackoverflow.com/questions/61376970/data-table-vs-dplyr-memory-use-revisited#61376971
So far, the "Get started" vignette explains that using lazy frames can lead to large gains in speed and memory, but the example doesn't really show the extent of these gains (which is acknowledged in the vignette).
Here's a small example of how
polars
can reorganize the query in order to optimize it (much faster and more memory-efficient to do filter-sort than sort-filter). Should something like this be included in the vignette or elsewhere?Created on 2023-04-26 with reprex v2.0.2