Open wckdman opened 2 years ago
Yes that is expected i believe. Which is why we recommend using a single file for optimal performance.
Especially if you do things like sample
- then you are randomly accessing rows, which is the least efficient thing to do in vaex. I don't know what your usecase is (if you are exploring and need to see different bits of the data, or if it is part of your computational process), but you want to avoid that in general.
Sometimes if we need to shuffle or sort, we do that operation, and then export the result to disk, so then read access is sequential and faster.
perhaps @maartenbreddels can provide more info here.
Description Method
to_pandas_df
works significantly slower on chunked data than on one file. Experimental data consist of 100k rows and 800 columns stored in two versions:I apply a set of operations such as
sort
andsample
followed byto_pandas_df
. These operations perform 5 times slower on chunked data. Also I noticed that performance on csv files is betterHere's a code snippet:
Software information