[BUG-REPORT] `tolist` is much slower than `to_numpy().tolist()`

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Vaex version (import vaex; vaex.__version__): {'vaex-core': '4.16.0', 'vaex-hdf5': '0.12.2'}
Vaex was installed via: pip / conda-forge / from source
OS:

Additional information

import vaex
df = vaex.example()
df.export("file.arrow")
df2 = vaex.open("file.arrow")

# time this
with vaex.cache.off():
    df2.id.tolist()

# vs this
with vaex.cache.off():
    df2.id.to_numpy.tolilst()

maartenbreddels commented 1 year ago

Interesting. It seems that this is due to Arrow's .to_pylist(). Can you see if you can reproduce this using arrow only? If so, this is an arrow performance issue.

Ben-Epstein commented 1 year ago

@maartenbreddels yes, it's happening in arrow as well

When the column is a numpy array within vaex, it is fast

Maybe vaex can know if the column can be a numpy array, and do this automatically? I will also open an issue in pyarrow

Ben-Epstein commented 1 year ago

Created https://github.com/apache/arrow/issues/34354

vaexio / vaex

[BUG-REPORT] `tolist` is much slower than `to_numpy().tolist()` #2325