pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

BUG: read_parquet converts pyarrow list type to numpy dtype #53011

Open danielhanchen opened 1 year ago

danielhanchen commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet") # *** FAIL

data_object = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = object),
})
data_object.to_parquet("data.parquet")
pyarrow_internal = pa.parquet.read_table("data.parquet") # SUCCESS with type list[string]
pyarrow_internal .to_pandas() # SUCCESS except object now

pd.Series(pd.arrays.ArrowExtensionArray(pyarrow_internal["Pyarrow"])) # SUCCESS - data-type also correct!

Issue Description

Great work on extending Arrow to Pandas! Using pd.ArrowDtype(pa.list_(pa.string())) or any other alteration works in the Parquet saving mode, but fails during the reading of the parquet file.

In fact, if there is a Pandas Series of pure lists of strings for eg ["a"], ["a", "b"], Parquet saves it internally as a list[string] type. When Pandas reads the parquet file, it then converts it to an object type.

Is there a way during the reading step to either:

  1. Convert the data-type like in the pure list mode to an object type OR
  2. pd.Series(pd.arrays.ArrowExtensionArray(x)) seems to actually work! Maybe during the conversion from the internal Pyarrow representation into Pandas, we can use pd.Series(pd.arrays.ArrowExtensionArray(x)) on columns which had errors? OR
  3. Somehow support these new types?

Expected Behavior

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet") # SUCCESS

Installed Versions

INSTALLED VERSIONS ------------------ commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.11.3.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 30 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Australia.1252 pandas : 2.0.1 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.7.2 pip : 23.1.2 Cython : 0.29.34 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : 0.57.0rc1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 zstandard : 0.21.0 tzdata : 2023.3 qtpy : None pyqt5 : None
danielhanchen commented 1 year ago

I found during the Pyarrow conversion, if you pass in a types_mapper and setting ignore_metadata to False, it works!

mapping = {schema.type : pd.ArrowDtype(schema.type) for schema in data.schema}
data.to_pandas(types_mapper = mapping.get, ignore_metadata = True)
mroeschke commented 1 year ago

From the traceback, it appears that pyarrow tries to convert this type to a numpy dtype by default, so I think an appropriate fix would be for pyarrow to just return an ArrowDtype here

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:812, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    809     table = _add_any_metadata(table, pandas_metadata)
    810     table, index = _reconstruct_index(table, index_descriptors,
    811                                       all_columns)
--> 812     ext_columns_dtypes = _get_extension_dtypes(
    813         table, all_columns, types_mapper)
    814 else:
    815     index = _pandas_api.pd.RangeIndex(table.num_rows)

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:865, in _get_extension_dtypes(table, columns_metadata, types_mapper)
    860 dtype = col_meta['numpy_type']
    862 if dtype not in _pandas_supported_numpy_types:
    863     # pandas_dtype is expensive, so avoid doing this for types
    864     # that are certainly numpy dtypes
--> 865     pandas_dtype = _pandas_api.pandas_dtype(dtype)
    866     if isinstance(pandas_dtype, _pandas_api.extension_dtype):
    867         if hasattr(pandas_dtype, "__from_arrow__"):

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:136, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:139, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File ~/.../pandas/core/dtypes/common.py:1626, in pandas_dtype(dtype)
   1621     with warnings.catch_warnings():
   1622         # GH#51523 - Series.astype(np.integer) doesn't show
   1623         # numpy deprecation warning of np.integer
   1624         # Hence enabling DeprecationWarning
   1625         warnings.simplefilter("always", DeprecationWarning)
-> 1626         npdtype = np.dtype(dtype)
   1627 except SyntaxError as err:
   1628     # np.dtype uses `eval` which can raise SyntaxError
   1629     raise TypeError(f"data type '{dtype}' not understood") from err

TypeError: data type 'list<item: string>[pyarrow]' not understood
danielhanchen commented 1 year ago

Hmm so I looked at the Pandas code, and not sure if using pd.ArrowDtype(dtype) will work.

The issue is data.schema.pandas_metadata['columns'][7]["numpy_type"] is a str and not an actual type object, and pd.ArrowDtype does not accept strings.

eg:

dt = A.schema.pandas_metadata['columns'][7]["numpy_type"]

returns:

'list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]'

and using

pd.ArrowDtype(dt)

fails since it's a string.

I think the better approach would be to not just pass in data.schema.pandas_metadata['columns'][j]["numpy_type"] but also data.schema.types since it has the actual types which can be converted into a pd.ArrowDtype object.

phofl commented 1 year ago

I think this behaves as expected. You can pass dtype_backend="pyarrow" to keep the list dtype

danielhanchen commented 1 year ago

@phofl

Oh oops I forgot to mention I tried pd.read_parquet(..., dtype_backend = "pyarrow"), and the TypeError still exists. The error is exactly the same, since it passes the dtype to np.dtype

danielhanchen commented 1 year ago

Confirmed it still fails:

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet", dtype_backend = "pyarrow") # *** FAIL
phofl commented 1 year ago

Interesting,

This one works:

data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.to_parquet("data.parquet")
pd.read_parquet("data.parquet", dtype_backend = "pyarrow")
danielhanchen commented 1 year ago

Ye that works since it's an object - Pyarrow indeed saves the data inside the parquet file as list[string].

The issue is if you explicity parse list[string] directly, it does not work.

Ie:

data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.dtypes

returns

Pyarrow    object
dtype: object
danielhanchen commented 1 year ago

In fact the object schema is converted:

pa.parquet.read_table("data.parquet")

returns

pyarrow.Table
Pyarrow: list<item: string>
  child 0, item: string
----
Pyarrow: [[["a"],["a","b"]]]
danielhanchen commented 1 year ago

Maybe a try except so to not break other parts of the Pandas repo?

https://github.com/apache/arrow/blob/a77aab07b02b7d0dd6bd9c9a11c4af067d26b674/python/pyarrow/pandas_compat.py#L855

Maybe a try except so to not break other parts of the Pandas repo?

    # infer the extension columns from the pandas metadata
>>for col_meta, field in zip(columns_metadata, table.schema):
        try:
            name = col_meta['field_name']
        except KeyError:
            name = col_meta['name']
        dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
            # pandas_dtype is expensive, so avoid doing this for types
            # that are certainly numpy dtypes
>>       try:
>>           pandas_dtype = _pandas_api.pandas_dtype(dtype)
>>       except:
>>           pandas_dtype = pd.ArrowDtype(field.type)

            if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                if hasattr(pandas_dtype, "__from_arrow__"):
                    ext_columns[name] = pandas_dtype
takacsd commented 1 year ago

Run into the same issue:

df = pd.DataFrame({'a': pd.Series([['a'], ['a', 'b']], dtype=pd.ArrowDtype(pa.list_(pa.string())))})

df.to_parquet('test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # *** FAIL

df.to_parquet('test.parquet')  # SUCCESS
pq.read_table('test.parquet').to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype)  # SUCCESS

df.to_parquet('test.parquet', store_schema=False)  # SUCCESS
pd.read_parquet('test.parquet')  # SUCCESS

I think the last case was not mentioned so far.

danielhanchen commented 1 year ago

@takacsd oh interesting - so it's possible its the schema storing component that's wrong?

takacsd commented 1 year ago

@danielhanchen I think the problem is in the pandas specific metadata. If the parquet file was created with something else (e.g.: AWS Athena) it could read it just fine.

pq.write_table(pa.table({'a': pa.array([['a'], ['a', 'b']], type=pa.list_(pa.string()))}), 'test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # SUCCESS

pq.write_table(pa.Table.from_pandas(df), 'test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # *** FAIL

pq.write_table(pa.Table.from_pandas(df).replace_schema_metadata(), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS

This is the pandas metadata btw:

{'column_indexes': [{'field_name': None,
                     'metadata': {'encoding': 'UTF-8'},
                     'name': None,
                     'numpy_type': 'object',
                     'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'a',
              'metadata': None,
              'name': 'a',
              'numpy_type': 'list<item: string>[pyarrow]',   # <---- this causes the error
              'pandas_type': 'list[unicode]'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range',
                    'name': None,
                    'start': 0,
                    'step': 1,
                    'stop': 2}],
 'pandas_version': '2.0.1'}

In the case of a simple 'numpy_type': 'int64[pyarrow]' type everything works, so I suspect the _pandas_api.pandas_dtype(dtype) doesn't support complex types (yet).

danielhanchen commented 1 year ago

@takacsd oh yep your reasoning sounds right - so I think adding a simple try except might be a simple maybe? Try calling numpy then if it fails, call pd.ArrowDtype

danielhanchen commented 1 year ago

The main issue I think is becausedtype is a string I guess. I'm not 100% sure about how _pandas_api.pandas_dtype works, but presumably it's a large dict mapping types in string form to the correct type. Due to the infinite nature of possible Arrow datatypes, I guess its not feasible to update the dictionary, so maybe the try except solution is the only reasonable solution?

Just my two cents.

takacsd commented 1 year ago

The main issue I think is becausedtype is a string I guess. I'm not 100% sure about how _pandas_api.pandas_dtype works, but presumably it's a large dict mapping types in string form to the correct type.

It seems a little more complicated than that: pandas_dtype looks up a registry. This iterates trough the registered ExtensionDtypes, and tries to make sense of the string. The ExtensionDtype should understand it but doesn't, because pa.type_for_alias(base_type) only understands basic types.

We already have a special case for temporal types, so I suppose we just need something similar for arrays and maps...

danielhanchen commented 1 year ago

@takacsd The issue though timestamps can be reasonably easy to construct from text.

The below could all be possible though:

list[list[struct[int, float]]]
list[int]
struct[list[datetime]]

Constructing Arrow dtypes from that could be potentially problematic.

I guess in theory one can iterate through the string, and create a string which you can then call eval on ie:

list[struct[int32, string]] is pa.list_(pa.struct((pa.int32(), pa.string())) then you can eval on it.

I think a wiser approach would be to use the Arrow dtype from data.schema.types then call pd.ArrowDtype on it

takacsd commented 1 year ago

@danielhanchen your approach only works here, and it just ignores the metadata. I'm not a pandas developer but I suppose they generated that metadata for a reason, so it may break some things if we just ignore it.

Properly parsing the string is obviously harder, but I still think it is the better solution...

danielhanchen commented 1 year ago

@takacsd agreed parsing the metadata string is the correct way.

I thought about how one would go about doing it. Eg take: list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]

You'll have to first find the type which has the first enclosed >, and continuously parse outwards. Ie if one makes a string converter, it'll have to find the inner-most enclosed data-type, then expand out, and encapsulate it with a while loop.

The while loop looks something like this:

left_pointer = 0
bracket_end   = dt.find(">") + 1
bracket_start = dt.rfind("<", left_pointer, bracket_end)
bracket_start = dt.rfind(" ", left_pointer, bracket_start) + 1
partial_dt = dt[bracket_start : bracket_end]
partial_dt = _partial_convert_dt(partial_dt)

and

def _partial_convert_dt(partial_dt):
    if partial_dt.startswith("dictionary"):
        value_type = re.findall('values=([^,]{1,})',  partial_dt)[0]
        index_type = re.findall('indices=([^,]{1,})', partial_dt)[0]
        ordered    = re.findall('ordered=([\d])',     partial_dt)[0]
        if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
        if not index_type.startswith("pa."): index_type = f"pa.{index_type}()"
        partial_dt = f"pa.dictionary(value_type = {value_type}, index_type = {index_type}, ordered = {ordered})"
    elif partial_dt.startswith("list"):
        value_type = partial_dt[partial_dt.find(" ")+1 : -1]
        if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
        partial_dt = f"pa.list_({value_type})"
    elif partial_dt.startswith("struct"):
        struct_part = partial_dt[len("struct<"):-1]
        all_structs = struct_part.split(", ")
        converted = []
        for struct in all_structs:
            name, type = struct.split(": ")
            if not type.startswith("pa."): type = f"pa.{type}()"
            converted.append(f"('{name}', {type})")
        partial_dt = f"pa.struct(({', '.join(converted)}))"
    return partial_dt
pass

The code just gets too cumbersome sadly - the above only supports struct, dictionary and list types.

The main issue is the infinite nesting of Arrow dtypes which overcomplicates the conversion process in my view.

danielhanchen commented 1 year ago

Actually a simpler solution is to directly all .replace on the string and replace list<element: to pa.list_(( etc.

However, this doesnt work with struct data-types, since struct also keeps note of each field name.

This means a struct field name could have dictionary as it's name, which means using eval will fail.

This probably means string parsing won't work for structs, but works for everything else. I still believe a try except is the simplest solution. Obviously now Python 3.11 has zero cost exceptions, which means if the conversion fails and gets to the except portion, it'll be slower. This means a refactoring of code by parsing in the Arrow data-type, and if the data-type does not exist in registry then we output the Arrow data-type.

takacsd commented 1 year ago

Yeah, after some experimenting, I think we need to gave up on parsing the type string:

These two:

pd.Series([{'a': 1, 'b': 1}], dtype=pd.ArrowDtype(pa.struct({'a': pa.int64(), 'b': pa.int64()})))
pd.Series([{'a: int64, b': 1}], dtype=pd.ArrowDtype(pa.struct({'a: int64, b': pa.int64()})))

both have the following type string: struct<a: int64, b: int64>[pyarrow].

But even if we disallow such cases, it is just too hard: I tried to write a recursive parser with some regexp, but I gave up. We need a balancing matcher or a recursive pattern to match the nested <> pairs properly, but none of them are supported by the built in regexp module. And I don't feel like we should write a full blown recursive descent parser for this one.

The fundamental problem is we try to parse a string which was not meant to be easily parsable. The metadata should save the nested data types in a way that is easy to work with...

takacsd commented 1 year ago

I was bored:

class ParseFail(Exception):
    pass

class Parsed(NamedTuple):
    type: pa.DataType
    end: int

class TypeStringParser:

    BASIC_TYPE_MATCHER = re.compile(r'\w+(\[[^\]]+\])?')
    TIMESTAMP_MATCHER = re.compile(r'timestamp\[([^,]+), tz=([^\]]+)\]')
    NAME_MATCHER = re.compile(r'\w+')  # this can be r'[^:]' to support weird names in struct

    def __init__(self, type_str: str) -> None:
        self.type_str = type_str

    def parse(self) -> pa.DataType:
        try:
            parsed = self.type(0)
        except ParseFail:
            raise ValueError(f"Can't parse '{self.type_str}' as a type.")

        if parsed.end != len(self.type_str):
            raise ValueError(f"Can't parse '{self.type_str}' as a type.")

        return self.type(0).type

    def type(self, pos: int) -> Parsed:
        try:
            return self.basic_type(pos)
        except ParseFail:
            pass

        try:
            return self.timestamp(pos)
        except ParseFail:
            pass

        try:
            return self.list(pos)
        except ParseFail:
            pass

        try:
            return self.dictionary(pos)
        except ParseFail:
            pass

        try:
            return self.struct(pos)
        except ParseFail:
            pass

        raise ParseFail()

    def basic_type(self, pos: int) -> pa.DataType:
        match = self.BASIC_TYPE_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        try:
            return Parsed(pa.type_for_alias(match.group(0)), match.end(0))
        except ValueError:
            pass
        raise ParseFail()

    def timestamp(self, pos: int) -> pa.DataType:
        match = self.TIMESTAMP_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        try:
            return Parsed(pa.timestamp(match.group(1).strip(), tz=match.group(2).strip()), match.end(0))
        except ValueError:
            pass
        raise ParseFail()

    def list(self, pos: int) -> pa.DataType:
        pos = self.accept('list<', pos)
        match = self.NAME_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        pos = self.accept(': ', match.end(0))
        item = self.type(pos)
        pos = self.accept('>', item.end)
        return Parsed(pa.list_(item.type), pos)

    def dictionary(self, pos: int) -> pa.DataType:
        pos = self.accept('dictionary<values=', pos)
        values = self.type(pos)
        pos = self.accept(', indices=', values.end)
        indices = self.type(pos)
        pos = self.accept(', ordered=', indices.end)
        try:
            pos = self.accept('0', pos)
            ordered = False
        except ParseFail:
            pos = self.accept('1', pos)
            ordered = True
        pos = self.accept('>', pos)
        return Parsed(pa.dictionary(indices.type, values.type, ordered), pos)

    def struct(self, pos: int) -> pa.DataType:
        pos = self.accept('struct<', pos)
        elements = []
        while self.type_str[pos] != '>':
            match = self.NAME_MATCHER.match(self.type_str, pos)
            if match is None:
                raise ParseFail()
            element_name = match.group(0)
            pos = self.accept(': ', match.end(0))
            element_type = self.type(pos)
            pos = element_type.end
            if self.type_str[pos] != '>':
                pos = self.accept(', ', pos)
            elements.append((element_name, element_type.type))
        pos = self.accept('>', pos)
        return Parsed(pa.struct(elements), pos)

    def accept(self, term: str, pos: int) -> int:
        if self.type_str.startswith(term, pos):
            return pos + len(term)
        raise ParseFail()

Probably not the prettiest recursive descent parser in existence, but it does parse arbitrary nested types. The only restriction that I know of is that the names in the structs needs to be alphanumeric.

danielhanchen commented 1 year ago

@takacsd Nice work on the parser! :) Ye struct is the biggest issue with it being able to have column names. It gets worse if struct<struct : uint8> exists - yikes that'll be a painful pain.

Also I just noticed but https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#L870 actually does in fact parse Arrow Dtypes! The issue is the code previous to it breaks, and it never gets there.

    for field in table.schema:
        typ = field.type
        if isinstance(typ, pa.BaseExtensionType):
            try:
                pandas_dtype = typ.to_pandas_dtype()
            except NotImplementedError:
                pass
            else:
                ext_columns[field.name] = pandas_dtype

The issue is https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#LL854C1-L868C53:

    # infer the extension columns from the pandas metadata
    for col_meta in columns_metadata:
        try:
            name = col_meta['field_name']
        except KeyError:
            name = col_meta['name']
        dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
            # pandas_dtype is expensive, so avoid doing this for types
            # that are certainly numpy dtypes
            pandas_dtype = _pandas_api.pandas_dtype(dtype)               >>>>>>>> BREAKS (A)
            if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                if hasattr(pandas_dtype, "__from_arrow__"):
                    ext_columns[name] = pandas_dtype

I think I might have fixed it WITHOUT using try except

            # infer the extension columns from the pandas metadata
            schema = table.schema                                                 <<<<
            for col_meta, field in zip(columns_metadata, schema):                       <<<<
                try:
                    name = col_meta['field_name']
                except KeyError:
                    name = col_meta['name']
                dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
                        # pandas_dtype is expensive, so avoid doing this for types
                        # that are certainly numpy dtypes
            if dtype.endswith("[pyarrow]"): pandas_dtype = pd.ArrowDtype(field.type)  <<<< (1)
            elif dtype == "string": pandas_dtype = pd.ArrowDtype(pa.string())        <<<< (2)
            else: pandas_dtype = pd_dtype(dtype)                                    <<<< (3)
                        if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                            if hasattr(pandas_dtype, "__from_arrow__"):
                                ext_columns[name] = pandas_dtype

We push the original command (A) to line (3) than if a string data-type has [pyarrow] as it's ending, we use pd.ArrowDtype. For strings, we ignore the pandas parser and just parse strings in the fastpath.

This also means

    for field in table.schema:
        typ = field.type
        if isinstance(typ, pa.BaseExtensionType):
            try:
                pandas_dtype = typ.to_pandas_dtype()
            except NotImplementedError:
                pass
            else:
                ext_columns[field.name] = pandas_dtype

can be deleted - it'ls redundant, since we folded the code into the previous code.

bretttully commented 1 year ago

I just hit this today trying to read a parquet file made by someone else, where they had used the pyarrow backend.

Here is another minimal example to add to the mix that fails on reading df2.

import io

import numpy as np
import pandas as pd
import pyarrow as pa

def main():
    df0 = pd.DataFrame(
        [
            {"foo": {"bar": True, "baz": np.float32(1)}},
            {"foo": {"bar": True, "baz": None}},
        ],
    )
    schema = pa.schema(
        [
            pa.field(
                "foo",
                pa.struct(
                    [
                        pa.field("bar", pa.bool_(), nullable=False),
                        pa.field("baz", pa.float32(), nullable=True),
                    ],
                ),
            ),
        ],
    )
    print(schema)
    with io.BytesIO() as stream0, io.BytesIO() as stream1:
        kwargs = {
            "engine": "pyarrow",
            "compression": "zstd",
            "schema": schema,
            "row_group_size": 2_000,
        }
        print("Writing df0")
        df0.to_parquet(stream0, **kwargs)

        print("Reading df1")
        stream0.seek(0)
        df1 = pd.read_parquet(stream0, engine="pyarrow", dtype_backend="pyarrow")

        print("Writing df1")
        df1.to_parquet(stream1, **kwargs)

        print("Reading df2")
        stream1.seek(0)
        df2 = pd.read_parquet(stream1, engine="pyarrow", dtype_backend="pyarrow")

if __name__ == "__main__":
    main()

Using df2 = pq.read_table(stream1).to_pandas(ignore_metadata=True) works for all of the reasons mentioned in the thread.

giftculture commented 9 months ago

I'm running into this issue as well:

Screenshot 2024-02-07 at 19 50 26

giftculture commented 9 months ago

INSTALLED VERSIONS

commit : f538741432edf55c6b9fb5d0d496d2dd1d7c2457 python : 3.11.7.final.0 python-bits : 64 OS : Linux OS-release : 4.13.9-300.fc27.x86_64 Version : #1 SMP Mon Oct 23 13:41:58 UTC 2017 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.0 numpy : 1.26.3 pytz : 2023.4 dateutil : 2.8.2 setuptools : 69.0.3 pip : 23.3.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.3 IPython : 8.20.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : 2023.10.1 fsspec : 2023.12.2 gcsfs : None matplotlib : 3.8.2 numba : 0.58.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : 2.0.25 tables : None tabulate : None xarray : 2024.1.1 xlrd : None zstandard : None tzdata : 2023.4 qtpy : 2.4.1 pyqt5 : None

jborman-stonex commented 7 months ago

The only "workaround" at the pandas-level I've found is to set df.to_parquet(..., store_schema=False) for a df containing complex/nested types like a categorical. Are there any plans to get successful dtype roundtripping going forward? Is there anything in the pyarrow library we can leverage here?

Versions:

``` INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.11.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 65.5.0 pip : 23.2.1 Cython : None pytest : 8.1.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : None IPython : 8.23.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None ```
phofl commented 7 months ago

I would recommend using dtype_backend="pyarrow"

giftculture commented 7 months ago

I would recommend using dtype_backend="pyarrow"

@phoff, not sure if you saw from my screenshot, but I did apply the dypte_backend="pyarrow" to the read_parquet method and it still fails, unless I am misunderstanding your suggestion

bretttully commented 7 months ago

The only workaround I have found so far is the following (which works in all cases I have thought of, except round-tripping an empty dataframe with a struct or list type, setting the schema, and not using dtype_backend="pyarrow" when reading back in).

Would def welcome suggested improvements to this workaround!

Obv you can write this differently if you don't want a byte string returned, but for us that's what we want.

    def serialize(self, data: pd.DataFrame, **kwargs) -> bytes:
        """see BytesWriter.serialize -- Dump pandas dataframe to parquet bytes"""
        with io.BytesIO() as stream:

            schema = kwargs.pop("schema", None)
            all_arrow_types = all(isinstance(t, pd.ArrowDtype) for t in data.dtypes.tolist())
            # An empty dataframe may use default dtypes that are incompatible with the schema.
            # In this case, first cast to object, as the schema can always convert that to the correct type.
            if len(data) == 0 and schema is not None and not all_arrow_types:
                data = data.astype("object").astype({n: pd.ArrowDtype(schema.field(n).type) for n in schema.names})
            table = pa.Table.from_pandas(data, schema=schema)

            # drop pandas from the schema metadata to work around the bug where you can't read struct columns with
            # pandas metadata
            # see https://github.com/pandas-dev/pandas/issues/53011
            metadata = table.schema.metadata
            if b"pandas" in metadata and b"list" in metadata[b"pandas"] or b"struct" in metadata[b"pandas"]:
                del metadata[b"pandas"]
                table = table.replace_schema_metadata(metadata)

            pq.write_table(table, stream, **kwargs)
            return stream.getvalue()
kinianlo commented 6 months ago

This bug is very similar to #57411 where the data type is list[int] instead of list[str].

judahrand commented 2 months ago

Actually a simpler solution is to directly all .replace on the string and replace list<element: to pa.list_(( etc.

Additionally, list<element: can also be list<item: depending on how the Parquet was written out (see: use_compliant_nested_type argument to pyarrow.parquet.ParquetWriter). And within Arrow itself the containing name be arbitrary (ie. pyarrow.list_(pyarrow.field('some_name', pyarrow.int64())) is a valid type). So the replace would have to be a regex and something like r'list<(.+):' with the capturing group grabbing the name of the field.

bretttully commented 1 week ago

Matching arrow ticket: https://github.com/apache/arrow/issues/39914 and potential PR: https://github.com/apache/arrow/pull/44720

jorisvandenbossche commented 1 week ago

Only seeing this long ticket now .. FWIW I think this is another good reason it would be good pandas had tighter control over the pandas<->arrow conversion (https://github.com/pandas-dev/pandas/issues/59780)