vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] Difference in time taken when slicing an offset differs based on sort order. #2341

Open vignesh-bungee opened 1 year ago

vignesh-bungee commented 1 year ago

Hi Vaex Team,
We are experiencing an issue with the sort function in Vaex. Specifically, when we sort our dataset (shape: (13160951, 77)) by the estimaterevenue column (dtype: float64) in descending order, we observe a delay in slicing the lower offset, while slicing the higher offset is relatively fast.On the other hand, when we sort in ascending order the same column, we observe that slicing the lower offset is relatively fast, while slicing the higher offset is slow. I would appreciate it if you could look into this issue. It seems to be a bug, can you please confirm ?

Software information • Vaex version 4.14.0 • Vaex was installed via: pip • OS: Linux

Additional information
Jupyter notebook screenshot image attached

image

Ben-Epstein commented 1 year ago

@vignesh-bungee i believe the first access is simply applying the sort, and the second one has it cached, but I'm not 100% certain.

Try doing the entire thing under a

with vaex.cache.off():
     ....

to see the results without cache

Oddly enough, I'm experiencing something different than you, wherein the sort itself is taking some time, but then (because it's cached) both the bottom and top accessors are extremely fast.

I am using a newer version of vaex, but I don't think what I'm seeing is particularly expected..

image
chitranshubungee commented 1 year ago

@Ben-Epstein Since our dataset is very large and we perform many operations on it, including slicing and dicing, using vaex.cache.off() can lead to degraded performance. We have tried this on our Jupyter notebook but found that it yield similar results with or without the cache off. Column information