Open Ark-kun opened 3 years ago
Hi ! This is because we are actually using the arrow streaming format. We plan to switch to the arrow IPC format. More info at #1933
Not sure if this was resolved, but I am getting a similar error when trying to load a dataset.arrow file directly: ArrowInvalid: Not an Arrow file
Since we're using the streaming format, you need to use open_stream
:
import pyarrow as pa
def in_memory_arrow_table_from_file(filename: str) -> pa.Table:
in_memory_stream = pa.input_stream(filename)
opened_stream = pa.ipc.open_stream(in_memory_stream)
pa_table = opened_stream.read_all()
return pa_table
def memory_mapped_arrow_table_from_file(filename: str) -> pa.Table:
memory_mapped_stream = pa.memory_map(filename)
opened_stream = pa.ipc.open_stream(memory_mapped_stream)
pa_table = opened_stream.read_all()
return pa_table
由于我们使用流格式,因此您需要使用
open_stream
:import pyarrow as pa def in_memory_arrow_table_from_file(filename: str) -> pa.Table: in_memory_stream = pa.input_stream(filename) opened_stream = pa.ipc.open_stream(in_memory_stream) pa_table = opened_stream.read_all() return pa_table def memory_mapped_arrow_table_from_file(filename: str) -> pa.Table: memory_mapped_stream = pa.memory_map(filename) opened_stream = pa.ipc.open_stream(memory_mapped_stream) pa_table = opened_stream.read_all() return pa_table
Thank you very much for providing the code that can read arrow file to pa_table and finally to dict, but how to implement the reverse process, how to restore a dict to arrow file?
Describe the bug
A clear and concise description of what the bug is.
Steps to reproduce the bug
Expected results
I expect that the saved dataset can be read by the official Apache Arrow methods.
Actual results
Environment info
datasets
version: datasets-1.6.2