pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.78k stars 17.97k forks source link

BUG: read_feather doesn't work when columns are shuffle #33878

Closed Benjamin15 closed 4 years ago

Benjamin15 commented 4 years ago

Code Sample, a copy-pastable example

# Your code here
import pandas as pd

df = pd.DataFrame({
    "A": [1, 2],
    "B": ["x", "y"],
    "C": [True, False]
})
df.to_feather("./test_data.feather")

df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])

Error message

ArrowInvalid                              Traceback (most recent call last)
<ipython-input-4-1e23cf201732> in <module>
     15 
     16 
---> 17 df2 = pd.read_feather("/misc/labshare/datasets3/rating/data/preprocessing/tests/test_data.feather", columns=['B', 'A'])

~/.conda/envs/venv/lib/python3.6/site-packages/pandas/io/feather_format.py in read_feather(path, columns, use_threads)
    101     path = stringify_path(path)
    102 
--> 103     return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
    206     """
    207     _check_pandas_version()
--> 208     return (read_table(source, columns=columns, memory_map=memory_map)
    209             .to_pandas(use_threads=use_threads))
    210 

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
    237         return reader.read_indices(columns)
    238     elif all(map(lambda t: t == str, column_types)):
--> 239         return reader.read_names(columns)
    240 
    241     column_type_names = [t.__name__ for t in column_types]

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.read_names()

~/.conda/envs/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
B: string
A: int64
vs
A: int64
B: string

Problem description

We don't always know the order in which our columns are. The issue is when we update pyarrow to 0.17.0

This line work fine:

df2 = pd.read_feather("./test_data.feather", columns=['B', 'A'])

Should we apply a fix here or in the pyarrow repository ?

Expected Output

df2 = pd.DataFrame({ "A": [1, 2], "B": ["x", "y"], })

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.6.7.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-91-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.0.3 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 46.1.3 Cython : 0.29.15 pytest : 5.3.2 hypothesis : 5.5.4 sphinx : 2.2.0 blosc : None feather : None xlsxwriter : 1.2.7 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : 1.3.1 fastparquet : 0.3.3 gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.3 numexpr : None odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : 0.17.0 pytables : None pytest : 5.3.2 pyxlsb : None s3fs : None scipy : 1.2.3 sqlalchemy : 1.3.13 tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.7 numba : 0.48.0
jorisvandenbossche commented 4 years ago

@Benjamin15 Thanks a lot for the report! This is indeed a regression. I opened an issue for this on the Arrow side (since the bug is in the latest pyarrow 0.17 release): https://issues.apache.org/jira/browse/ARROW-8641

jorisvandenbossche commented 4 years ago

Closed by #34883