vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT]Performance issues while working with chunked data #2063

Open wckdman opened 2 years ago

wckdman commented 2 years ago

Description Method to_pandas_df works significantly slower on chunked data than on one file. Experimental data consist of 100k rows and 800 columns stored in two versions:

I apply a set of operations such as sort and sample followed by to_pandas_df. These operations perform 5 times slower on chunked data. Also I noticed that performance on csv files is better

Here's a code snippet:

import vaex
df = vaex.open("data/*.parquet")
df = df.sample(n=20_000, random_state=42)
pdf = df.to_pandas_df()

Software information

JovanVeljanoski commented 2 years ago

Yes that is expected i believe. Which is why we recommend using a single file for optimal performance.

Especially if you do things like sample - then you are randomly accessing rows, which is the least efficient thing to do in vaex. I don't know what your usecase is (if you are exploring and need to see different bits of the data, or if it is part of your computational process), but you want to avoid that in general.

Sometimes if we need to shuffle or sort, we do that operation, and then export the result to disk, so then read access is sequential and faster.

perhaps @maartenbreddels can provide more info here.