[BUG-REPORT]Performance issues while working with chunked data

vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

MIT License

8.28k stars 590 forks source link

Description Method to_pandas_df works significantly slower on chunked data than on one file. Experimental data consist of 100k rows and 800 columns stored in two versions:

one parquet file
ten parquet files with 10k rows each

I apply a set of operations such as sort and sample followed by to_pandas_df. These operations perform 5 times slower on chunked data. Also I noticed that performance on csv files is better

Here's a code snippet:

import vaex
df = vaex.open("data/*.parquet")
df = df.sample(n=20_000, random_state=42)
pdf = df.to_pandas_df()

Software information

Vaex version: vaex-core==4.9.1, pyarrow==8.0.0, fastparquet==0.8.1
Vaex was installed via: pip
OS: MacOS 11.6.6

vaexio / vaex

[BUG-REPORT]Performance issues while working with chunked data #2063