ArrowDataset.save_to_disk produces files that cannot be read using pyarrow.feather

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.26k stars 2.69k forks source link

ArrowDataset.save_to_disk produces files that cannot be read using pyarrow.feather #2377

Open Ark-kun opened 3 years ago

Ark-kun commented 3 years ago

Describe the bug

A clear and concise description of what the bug is.

Steps to reproduce the bug

from datasets import load_dataset
from pyarrow import feather

dataset = load_dataset('imdb', split='train')
dataset.save_to_disk('dataset_dir')
table = feather.read_table('dataset_dir/dataset.arrow')

Expected results

I expect that the saved dataset can be read by the official Apache Arrow methods.

Actual results

  File "/usr/local/lib/python3.7/site-packages/pyarrow/feather.py", line 236, in read_table
    reader.open(source, use_memory_map=memory_map)
  File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherReader.open
  File "pyarrow/error.pxi", line 123, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file

Environment info

datasets version: datasets-1.6.2
Platform: Linux
Python version: 3.7
PyArrow version: 0.17.1, also 2.0.0

lhoestq commented 3 years ago

Hi ! This is because we are actually using the arrow streaming format. We plan to switch to the arrow IPC format. More info at #1933

ijmiller2 commented 1 year ago

Not sure if this was resolved, but I am getting a similar error when trying to load a dataset.arrow file directly: ArrowInvalid: Not an Arrow file

lhoestq commented 1 year ago

Since we're using the streaming format, you need to use open_stream:

import pyarrow as pa

def in_memory_arrow_table_from_file(filename: str) -> pa.Table:
    in_memory_stream = pa.input_stream(filename)
    opened_stream = pa.ipc.open_stream(in_memory_stream)
    pa_table = opened_stream.read_all()
    return pa_table

def memory_mapped_arrow_table_from_file(filename: str) -> pa.Table:
    memory_mapped_stream = pa.memory_map(filename)
    opened_stream = pa.ipc.open_stream(memory_mapped_stream)
    pa_table = opened_stream.read_all()
    return pa_table

wangfan120 commented 10 months ago

由于我们使用流格式，因此您需要使用open_stream：
import pyarrow as pa

def in_memory_arrow_table_from_file(filename: str) -> pa.Table:
    in_memory_stream = pa.input_stream(filename)
    opened_stream = pa.ipc.open_stream(in_memory_stream)
    pa_table = opened_stream.read_all()
    return pa_table

def memory_mapped_arrow_table_from_file(filename: str) -> pa.Table:
    memory_mapped_stream = pa.memory_map(filename)
    opened_stream = pa.ipc.open_stream(memory_mapped_stream)
    pa_table = opened_stream.read_all()
    return pa_table
Thank you very much for providing the code that can read arrow file to pa_table and finally to dict, but how to implement the reverse process, how to restore a dict to arrow file?