vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] `tolist` is much slower than `to_numpy().tolist()` #2325

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Additional information

import vaex
df = vaex.example()
df.export("file.arrow")
df2 = vaex.open("file.arrow")

# time this
with vaex.cache.off():
    df2.id.tolist()

# vs this
with vaex.cache.off():
    df2.id.to_numpy.tolilst()
image
maartenbreddels commented 1 year ago

Interesting. It seems that this is due to Arrow's .to_pylist(). Can you see if you can reproduce this using arrow only? If so, this is an arrow performance issue.

Ben-Epstein commented 1 year ago

@maartenbreddels yes, it's happening in arrow as well

image

When the column is a numpy array within vaex, it is fast

image

Maybe vaex can know if the column can be a numpy array, and do this automatically? I will also open an issue in pyarrow

Ben-Epstein commented 1 year ago

Created https://github.com/apache/arrow/issues/34354