vega / vega-datasets

Common repository for example datasets used by Vega-related projects
261 stars 207 forks source link

Unable to process .arrow file in the datasets #545

Closed ramses-lee closed 7 months ago

ramses-lee commented 7 months ago

A general demonstration is outlined here in the google collar file: https://colab.research.google.com/drive/1oKhivD5T9Yi1gMl0_7dUwqVFqiNfD43k?usp=sharing

The 'flights-200k.arrow" is producing an error every time I tried to read in the file using Pandas package.

domoritz commented 7 months ago

Can you try reading it as a file and stream? Maybe try pyarrow directly.

ramses-lee commented 7 months ago

Not exactly sure what you meant, but I tested both parquet read_table() function as well as the pyarrow memory_map() function and both gave me an error.

domoritz commented 7 months ago

Ahh, I fixed it. The file wasn't closed properly.

This works now.

import pyarrow as pa

with open('data/flights-200k.arrow', 'rb') as f:
    buf = f.read()

    with pa.ipc.open_file(buf) as reader:
        df = reader.read_pandas()

        print(df)