vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

Will vaex support arrow tables with more than 1 chunk? #716

Closed nugend closed 3 years ago

nugend commented 4 years ago

Not sure if this is a bug or an intentional restriction, but it's not spelled out in the documentation.

You can convert these to concatenated dataframes, but they break badly when you actually try and use them at the moment. I think that some fixes for concatenated frames are currently in the pipeline for the next release?

import vaex, pyarrow as pa
from vaex.dataframe import DataFrameConcatenated
from pyarrow.dataset import dataset
from pyarrow.fs import LocalFileSystem

def convert_multi_chunk_arrow_table_to_vaex(t):
    vts = list()
    for i, b in enumerate(t.to_batches()):
        vt = vaex.from_arrow_table(pa.Table.from_batches([b]))
        # vaex expects tables to have a name and path when concatenated? bug?
        vt.name = f'name-{i}'
        vt.path = f'path-{i}'
        vts.append(vt)
    print("Converted to Arrow Table List")
    return DataFrameConcatenated(vts)

t = dataset('/home/dnugent/feather_demo', 
               format= "feather", 
               filesystem=LocalFileSystem(use_mmap=True)).to_table()
df = convert_multi_chunk_arrow_table_to_vaex(t)

Anyway, wasn't really sure if this was a bug or known limit. Thanks.

JovanVeljanoski commented 4 years ago

Hi @nugend

Indeed in the past concatenated DataFrames were not our focus and had many limitations (bugs). Recently there have been many improvements, and I think all reported bugs regarding concatenated DataFrames have been resolved.

This is all in the master branch (in case you are comfortable with a dev install).

Otherwise, we expect to make a new release in the coming 1-2 weeks, which will include all the work done on the concatenated DataFrames.

Regarding your main question, we do plan on supporting chunked arrow table soon as well (i.e. reading them easily in one line).

JovanVeljanoski commented 3 years ago

I believe this is not supported. Please reopen if the issue persists.