Closed andrei-ionescu closed 1 year ago
Could you provide me with a docker setup for spark with the right settings so I can reproduce what you did?
What are your timings in the polars python api?
P.S. I changed the title as we do much more than only joins here.
@ritchie46: Currently Apache Spark is installed on my local machine, the same place where I also did run the Polars test. Spark version is 3.2.1
.
I don't plan to use Python because I want to be as close as possible to the hardware.
Could you still get some timings in python? I know that the python bindings have the optimal settings for optimizations and allocators, so that would give me more information.
@andrei-ionescu Are you sure you compiled in release mode and with optimizations?
In [16]: file1 = 'data_fact_anonymised_mot_test_item_2021_20220620T090029Z_anonymised_mot_test_item_2021.parquet'
In [17]: file2 = 'data_dimension_vehicle_20220620T085710Z_vehicle.parquet'
In [18]: df1 = pl.scan_parquet(file1)
In [19]: df2 = pl.scan_parquet(file2)
In [20]: df3 = (
...: df2
...: .join(
...: df1,
...: left_on=[pl.col("drv_vehicle_key")],
...: right_on=[pl.col("drv_vehicle_key")],
...: how="inner",
...: )
...: .groupby([pl.col("drv_vehicle_make")])
...: .agg([pl.count().alias("cnt")])
...: .select([
...: pl.col("drv_vehicle_make"),
...: pl.col("cnt"),
...: ])
...: .sort(
...: "cnt",
...: reverse=True,
...: nulls_last=True,
...: )
...: .limit(20)
...: )
In [21]: %time df4 = df3.collect()
CPU times: user 16.4 s, sys: 2.13 s, total: 18.6 s
Wall time: 2.8 s
In [22]: df4
Out[22]:
shape: (20, 2)
┌──────────────────┬──────────┐
│ drv_vehicle_make ┆ cnt │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════════════╪══════════╡
│ FORD ┆ 13551161 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VAUXHALL ┆ 9322023 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VOLKSWAGEN ┆ 7671689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ PEUGEOT ┆ 4152401 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ MINI ┆ 1586903 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ SKODA ┆ 1348856 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ MAZDA ┆ 1318925 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VOLVO ┆ 1234986 │
└──────────────────┴──────────┘
(Changed "cat" with "cnt" in the select statement to get it running).
@ghuls: Could you provide the Rust code counterpart? And what does "compiled in release mode and with optimizations" mean? Do you have a command or some options that I need to use?
@andrei-ionescu could you run the code provided by @ghuls and report the time it takes on your machine?
@ritchie46, @ghuls: I just did run the Python code above in Python 3 and the result is somewhere around 5s
.
It seems that there is something missing on the Rust side.
There are several things you can do in order of ease:
{ "performant", "chunked_ids" }
The python polars bindings have all of these except your target cpu flags.
@ritchie46: Is there some documentation on how to have this on par with Python. If it works that well on Python, what's the options that Python has?
I'm asking because rustc
has a lot of parameters to toggle. I'm interested in the ones used when building Polars for Python.
Even with just cargo build --release
:
❯ timeit polars_lazy_scan_parquet/target/release/polars_lazy_scan_parquet
[src/main.rs:54] df3 = shape: (20, 2)
┌──────────────────┬──────────┐
│ drv_vehicle_make ┆ cnt │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════════════╪══════════╡
│ FORD ┆ 13551161 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VAUXHALL ┆ 9322023 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VOLKSWAGEN ┆ 7671689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ PEUGEOT ┆ 4152401 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ MINI ┆ 1586903 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ SKODA ┆ 1348856 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ MAZDA ┆ 1318925 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VOLVO ┆ 1234986 │
└──────────────────┴──────────┘
Total time 3.284089366s
Time output:
------------
* Command: polars_lazy_scan_parquet/target/release/polars_lazy_scan_parquet
* Elapsed wall time: 0:03.33 = 3.33 seconds
* Elapsed CPU time:
- User: 19.42
- Sys: 2.44
* CPU usage: 655%
* Context switching:
- Voluntarily (e.g.: waiting for I/O operation): 29911
- Involuntarily (time slice expired): 5716
* Maximum resident set size (RSS: memory) (kiB): 4752296
* Number of times the process was swapped out of main memory: 0
* Filesystem:
- # of inputs: 0
- # of outputs: 0
* Exit status: 0
@ghuls: In my case is not working.
This the code I use:
use polars::chunked_array::object::SortOptions;
use polars::prelude::{LazyFrame, ScanArgsParquet, PolarsError, JoinType, ParallelStrategy};
use polars::lazy::dsl::*;
use std::time::Instant;
fn main() -> Result<(), PolarsError> {
let time_pass = Instant::now();
let file1 = "data/parquet/data-fact-anonymised_mot_test_item-2021-20220620T090029Z-anonymised_mot_test_item_2021.parquet".to_string();
let file2 = "data/parquet/data-dimension-vehicle-20220620T085710Z-vehicle.parquet".to_string();
let df1 = LazyFrame::scan_parquet(
file1,
scan_args(),
)
.unwrap();
let df2 = LazyFrame::scan_parquet(
file2,
scan_args(),
)
.unwrap();
let df3 = df2
.join(
df1,
[col("drv_vehicle_key")],
[col("drv_vehicle_key")],
JoinType::Inner,
)
.groupby([col("drv_vehicle_make")])
.agg([count().alias("cnt")])
.select([
col("drv_vehicle_make"),
col("cnt"),
])
.sort(
&"cnt",
SortOptions {
descending: true,
nulls_last: true,
}
)
.limit(20)
.collect()
.unwrap();
dbg!(df3);
println!("Total time {:?} seconds", time_pass.elapsed().as_secs());
Ok(())
}
fn scan_args() -> ScanArgsParquet { ScanArgsParquet {
n_rows: None,
cache: true,
parallel: ParallelStrategy::Auto,
rechunk: true,
row_count: None,
}}
This is the Cargo.toml
:
[package]
name = "my app"
version = "0.1.0"
edition = "2021"
[dependencies]
polars = { path = "/.../polars/polars", features = [ "polars-io", "parquet", "lazy", "dtype-struct" ] }
polars-sql = { path = "/.../polars/polars-sql" }
This is what I do to build:
$ cargo build --release --target=x86_64-apple-darwin
I'm on a macOS Monterey 12.3.1.
This is how I run it:
$ cargo run
The output is this:
[src/main.rs:48] df3 = shape: (20, 2)
┌──────────────────┬──────────┐
│ drv_vehicle_make ┆ cnt │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════════════╪══════════╡
│ FORD ┆ 13551161 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VAUXHALL ┆ 9322023 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VOLKSWAGEN ┆ 7671689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ PEUGEOT ┆ 4152401 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ MINI ┆ 1586903 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ SKODA ┆ 1348856 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ MAZDA ┆ 1318925 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VOLVO ┆ 1234986 │
└──────────────────┴──────────┘
Total time 47 seconds
So...
I am on Linux.
Invoke the binary directly or use cargo run --release --target=x86_64-apple-darwin
. You are running the debug version.
With just cargo run
it takes 51.650563074s seconds.
@ghuls: Yes! Hooray! That was it. Thank you so much for bearing me in all this use case. This look very promising in what I'm trying to do.
[src/main.rs:48] df3 = shape: (20, 2)
┌──────────────────┬──────────┐
│ drv_vehicle_make ┆ cnt │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════════════╪══════════╡
│ FORD ┆ 13551161 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VAUXHALL ┆ 9322023 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VOLKSWAGEN ┆ 7671689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ PEUGEOT ┆ 4152401 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ MINI ┆ 1586903 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ SKODA ┆ 1348856 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ MAZDA ┆ 1318925 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ VOLVO ┆ 1234986 │
└──────────────────┴──────────┘
Total time 4 seconds
Maybe it would be a good idea to specify this somewhere in the documentation.
@ritchie46, @jorgecarleitao, @ghuls: WDYT?
Maybe it would be a good idea to specify this somewhere in the documentation.
@ritchie46, @jorgecarleitao, @ghuls: WDYT?
This can be assumed common knowledge by rust users. Have you finished reading the Rust book?
I agree @andrei-ionescu that it is a bit confusing cargo build ...
essentially creates the binary. To run it, use ./target/{release,debug}/{program} ...
. Alternatively, you need to use cargo run ...
, which is basically an alias to the above. I think that cargo should be the place to cover that, as @ritchie46 wrote.
An idea: maybe we could request that users filling issues here using Rust Polars to paste the bash command that they are using to run the example (besides the code)? It could make it easier for everyone involved and does not really hurt (they already have to copy-paste the example...).
An idea: maybe we could request that users filling issues here using Rust Polars to paste the bash command that they are using to run the example (besides the code)? It could make it easier for everyone involved and does not really hurt (they already have to copy-paste the example...).
Yeap, that might save some time for everybody involved, I shall update the issue template.
Ok. So you're suggesting to add a new section on the issue template where the people reporting Rust issue to write their build command and execution command. I think that would be a good start. But, still won't spare you guys from the fact that you'll need to check those commands every time an issue is created 😄.
Giving people the tools to check themselves is a better approach in my opinion and minimises the probability of new issues, like this one, being created.
What language are you using?
Rust
Which feature gates did you use?
"polars-io", "parquet", "lazy", "dtype-struct"
Have you tried latest version of polars?
What version of polars are you using?
latest, git master
What operating system are you using polars on?
maxOS Monterey 12.3.1
What language version are you using
Describe your bug.
The processing time in the case of Polars is 4x longer.
What are the steps to reproduce the behavior?
Given the following two dataset from the OpenDataBlendIO:
978Mb
,82,426,973 rows
59Mb
,1,144,650 rows
Comparing the processing time of this Polars DataFrame API query:
with the similar one in Spark:
The results are:
Everything is in milliseconds (ms). The tries that were made in Spark every time were started from scratch with a new instance of Spark.
What is the actual behavior?
The processing time in the case of Polars is 4x longer.
What is the expected behavior?
The process time to be better in the case of Polars.