vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] Cannot export large_string type to arrow file #2217

Closed Ben-Epstein closed 1 year ago

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Additional information

import vaex
import pyarrow as pa

df = vaex.from_arrays(
    id=list(range(100)),
    text=[f"test{i}" for i in range(100)]
)

df["text"] = df["text"].to_arrow().cast(pa.large_string())
df.export("file.arrow")
~.venv/lib/python3.7/site-packages/vaex/dataframe.py in write(writer)
   6725                 with vaex.progress.tree(progress, title="export(arrow)") as progressbar:
   6726                     for i1, i2, table in self.to_arrow_table(chunk_size=chunk_size, parallel=parallel, reduce_large=reduce_large):
-> 6727                         writer.write_table(table)
   6728                         progressbar(i2/N)
   6729                     progressbar(1.)

~.venv/lib/python3.7/site-packages/pyarrow/ipc.pxi in pyarrow.lib._CRecordBatchWriter.write_table()

~.venv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Tried to write record batch with different schema
Ben-Epstein commented 1 year ago

ah my vaex version was incorrect. On new (4.13) it works

maartenbreddels commented 1 year ago

Good to hear!