Open danielhanchen opened 1 year ago
I found during the Pyarrow conversion, if you pass in a types_mapper
and setting ignore_metadata
to False
, it works!
mapping = {schema.type : pd.ArrowDtype(schema.type) for schema in data.schema}
data.to_pandas(types_mapper = mapping.get, ignore_metadata = True)
From the traceback, it appears that pyarrow tries to convert this type to a numpy dtype by default, so I think an appropriate fix would be for pyarrow to just return an ArrowDtype
here
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:812, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
809 table = _add_any_metadata(table, pandas_metadata)
810 table, index = _reconstruct_index(table, index_descriptors,
811 all_columns)
--> 812 ext_columns_dtypes = _get_extension_dtypes(
813 table, all_columns, types_mapper)
814 else:
815 index = _pandas_api.pd.RangeIndex(table.num_rows)
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:865, in _get_extension_dtypes(table, columns_metadata, types_mapper)
860 dtype = col_meta['numpy_type']
862 if dtype not in _pandas_supported_numpy_types:
863 # pandas_dtype is expensive, so avoid doing this for types
864 # that are certainly numpy dtypes
--> 865 pandas_dtype = _pandas_api.pandas_dtype(dtype)
866 if isinstance(pandas_dtype, _pandas_api.extension_dtype):
867 if hasattr(pandas_dtype, "__from_arrow__"):
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:136, in pyarrow.lib._PandasAPIShim.pandas_dtype()
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:139, in pyarrow.lib._PandasAPIShim.pandas_dtype()
File ~/.../pandas/core/dtypes/common.py:1626, in pandas_dtype(dtype)
1621 with warnings.catch_warnings():
1622 # GH#51523 - Series.astype(np.integer) doesn't show
1623 # numpy deprecation warning of np.integer
1624 # Hence enabling DeprecationWarning
1625 warnings.simplefilter("always", DeprecationWarning)
-> 1626 npdtype = np.dtype(dtype)
1627 except SyntaxError as err:
1628 # np.dtype uses `eval` which can raise SyntaxError
1629 raise TypeError(f"data type '{dtype}' not understood") from err
TypeError: data type 'list<item: string>[pyarrow]' not understood
Hmm so I looked at the Pandas code, and not sure if using pd.ArrowDtype(dtype)
will work.
The issue is data.schema.pandas_metadata['columns'][7]["numpy_type"]
is a str
and not an actual type
object, and pd.ArrowDtype does not accept strings.
eg:
dt = A.schema.pandas_metadata['columns'][7]["numpy_type"]
returns:
'list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]'
and using
pd.ArrowDtype(dt)
fails since it's a string.
I think the better approach would be to not just pass in data.schema.pandas_metadata['columns'][j]["numpy_type"]
but also data.schema.types
since it has the actual types which can be converted into a pd.ArrowDtype
object.
I think this behaves as expected. You can pass dtype_backend="pyarrow"
to keep the list dtype
@phofl
Oh oops I forgot to mention I tried pd.read_parquet(..., dtype_backend = "pyarrow")
, and the TypeError
still exists. The error is exactly the same, since it passes the dtype to np.dtype
Confirmed it still fails:
import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
"Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet", dtype_backend = "pyarrow") # *** FAIL
Interesting,
This one works:
data = pd.DataFrame({
"Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.to_parquet("data.parquet")
pd.read_parquet("data.parquet", dtype_backend = "pyarrow")
Ye that works since it's an object
- Pyarrow indeed saves the data inside the parquet file as list[string]
.
The issue is if you explicity parse list[string]
directly, it does not work.
Ie:
data = pd.DataFrame({
"Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.dtypes
returns
Pyarrow object
dtype: object
In fact the object
schema is converted:
pa.parquet.read_table("data.parquet")
returns
pyarrow.Table
Pyarrow: list<item: string>
child 0, item: string
----
Pyarrow: [[["a"],["a","b"]]]
Maybe a try
except
so to not break other parts of the Pandas repo?
Maybe a try
except
so to not break other parts of the Pandas repo?
# infer the extension columns from the pandas metadata
>>for col_meta, field in zip(columns_metadata, table.schema):
try:
name = col_meta['field_name']
except KeyError:
name = col_meta['name']
dtype = col_meta['numpy_type']
if dtype not in _pandas_supported_numpy_types:
# pandas_dtype is expensive, so avoid doing this for types
# that are certainly numpy dtypes
>> try:
>> pandas_dtype = _pandas_api.pandas_dtype(dtype)
>> except:
>> pandas_dtype = pd.ArrowDtype(field.type)
if isinstance(pandas_dtype, _pandas_api.extension_dtype):
if hasattr(pandas_dtype, "__from_arrow__"):
ext_columns[name] = pandas_dtype
Run into the same issue:
df = pd.DataFrame({'a': pd.Series([['a'], ['a', 'b']], dtype=pd.ArrowDtype(pa.list_(pa.string())))})
df.to_parquet('test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # *** FAIL
df.to_parquet('test.parquet') # SUCCESS
pq.read_table('test.parquet').to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype) # SUCCESS
df.to_parquet('test.parquet', store_schema=False) # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS
I think the last case was not mentioned so far.
@takacsd oh interesting - so it's possible its the schema storing component that's wrong?
@danielhanchen I think the problem is in the pandas specific metadata. If the parquet file was created with something else (e.g.: AWS Athena) it could read it just fine.
pq.write_table(pa.table({'a': pa.array([['a'], ['a', 'b']], type=pa.list_(pa.string()))}), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS
pq.write_table(pa.Table.from_pandas(df), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # *** FAIL
pq.write_table(pa.Table.from_pandas(df).replace_schema_metadata(), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS
This is the pandas metadata btw:
{'column_indexes': [{'field_name': None,
'metadata': {'encoding': 'UTF-8'},
'name': None,
'numpy_type': 'object',
'pandas_type': 'unicode'}],
'columns': [{'field_name': 'a',
'metadata': None,
'name': 'a',
'numpy_type': 'list<item: string>[pyarrow]', # <---- this causes the error
'pandas_type': 'list[unicode]'}],
'creator': {'library': 'pyarrow', 'version': '11.0.0'},
'index_columns': [{'kind': 'range',
'name': None,
'start': 0,
'step': 1,
'stop': 2}],
'pandas_version': '2.0.1'}
In the case of a simple 'numpy_type': 'int64[pyarrow]'
type everything works, so I suspect the _pandas_api.pandas_dtype(dtype)
doesn't support complex types (yet).
@takacsd oh yep your reasoning sounds right - so I think adding a simple try except might be a simple maybe? Try calling numpy then if it fails, call pd.ArrowDtype
The main issue I think is becausedtype
is a string I guess.
I'm not 100% sure about how _pandas_api.pandas_dtype
works, but presumably it's a large dict
mapping types in string form to the correct type. Due to the infinite nature of possible Arrow datatypes, I guess its not feasible to update the dictionary, so maybe the try except solution is the only reasonable solution?
Just my two cents.
The main issue I think is because
dtype
is a string I guess. I'm not 100% sure about how_pandas_api.pandas_dtype
works, but presumably it's a largedict
mapping types in string form to the correct type.
It seems a little more complicated than that:
pandas_dtype
looks up a registry.
This iterates trough the registered ExtensionDtypes, and tries to make sense of the string.
The ExtensionDtype should understand it but doesn't, because pa.type_for_alias(base_type)
only understands basic types.
We already have a special case for temporal types, so I suppose we just need something similar for arrays and maps...
@takacsd The issue though timestamps can be reasonably easy to construct from text.
The below could all be possible though:
list[list[struct[int, float]]]
list[int]
struct[list[datetime]]
Constructing Arrow dtypes from that could be potentially problematic.
I guess in theory one can iterate through the string, and create a string which you can then call eval
on ie:
list[struct[int32, string]]
is pa.list_(pa.struct((pa.int32(), pa.string()))
then you can eval on it.
I think a wiser approach would be to use the Arrow dtype from data.schema.types
then call pd.ArrowDtype
on it
@danielhanchen your approach only works here, and it just ignores the metadata. I'm not a pandas developer but I suppose they generated that metadata for a reason, so it may break some things if we just ignore it.
Properly parsing the string is obviously harder, but I still think it is the better solution...
@takacsd agreed parsing the metadata string is the correct way.
I thought about how one would go about doing it. Eg take:
list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]
You'll have to first find the type which has the first enclosed >
, and continuously parse outwards. Ie if one makes a string converter, it'll have to find the inner-most enclosed data-type, then expand out, and encapsulate it with a while
loop.
The while loop looks something like this:
left_pointer = 0
bracket_end = dt.find(">") + 1
bracket_start = dt.rfind("<", left_pointer, bracket_end)
bracket_start = dt.rfind(" ", left_pointer, bracket_start) + 1
partial_dt = dt[bracket_start : bracket_end]
partial_dt = _partial_convert_dt(partial_dt)
and
def _partial_convert_dt(partial_dt):
if partial_dt.startswith("dictionary"):
value_type = re.findall('values=([^,]{1,})', partial_dt)[0]
index_type = re.findall('indices=([^,]{1,})', partial_dt)[0]
ordered = re.findall('ordered=([\d])', partial_dt)[0]
if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
if not index_type.startswith("pa."): index_type = f"pa.{index_type}()"
partial_dt = f"pa.dictionary(value_type = {value_type}, index_type = {index_type}, ordered = {ordered})"
elif partial_dt.startswith("list"):
value_type = partial_dt[partial_dt.find(" ")+1 : -1]
if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
partial_dt = f"pa.list_({value_type})"
elif partial_dt.startswith("struct"):
struct_part = partial_dt[len("struct<"):-1]
all_structs = struct_part.split(", ")
converted = []
for struct in all_structs:
name, type = struct.split(": ")
if not type.startswith("pa."): type = f"pa.{type}()"
converted.append(f"('{name}', {type})")
partial_dt = f"pa.struct(({', '.join(converted)}))"
return partial_dt
pass
The code just gets too cumbersome sadly - the above only supports struct, dictionary and list types.
The main issue is the infinite nesting of Arrow dtypes which overcomplicates the conversion process in my view.
Actually a simpler solution is to directly all .replace
on the string and replace list<element:
to pa.list_((
etc.
However, this doesnt work with struct
data-types, since struct also keeps note of each field name.
This means a struct field name could have dictionary
as it's name, which means using eval
will fail.
This probably means string parsing won't work for struct
s, but works for everything else. I still believe a try except
is the simplest solution. Obviously now Python 3.11 has zero cost exceptions, which means if the conversion fails and gets to the except
portion, it'll be slower. This means a refactoring of code by parsing in the Arrow
data-type, and if the data-type does not exist in registry
then we output the Arrow
data-type.
Yeah, after some experimenting, I think we need to gave up on parsing the type string:
These two:
pd.Series([{'a': 1, 'b': 1}], dtype=pd.ArrowDtype(pa.struct({'a': pa.int64(), 'b': pa.int64()})))
pd.Series([{'a: int64, b': 1}], dtype=pd.ArrowDtype(pa.struct({'a: int64, b': pa.int64()})))
both have the following type string: struct<a: int64, b: int64>[pyarrow]
.
But even if we disallow such cases, it is just too hard: I tried to write a recursive parser with some regexp, but I gave up. We need a balancing matcher or a recursive pattern to match the nested <>
pairs properly, but none of them are supported by the built in regexp module. And I don't feel like we should write a full blown recursive descent parser for this one.
The fundamental problem is we try to parse a string which was not meant to be easily parsable. The metadata should save the nested data types in a way that is easy to work with...
I was bored:
class ParseFail(Exception):
pass
class Parsed(NamedTuple):
type: pa.DataType
end: int
class TypeStringParser:
BASIC_TYPE_MATCHER = re.compile(r'\w+(\[[^\]]+\])?')
TIMESTAMP_MATCHER = re.compile(r'timestamp\[([^,]+), tz=([^\]]+)\]')
NAME_MATCHER = re.compile(r'\w+') # this can be r'[^:]' to support weird names in struct
def __init__(self, type_str: str) -> None:
self.type_str = type_str
def parse(self) -> pa.DataType:
try:
parsed = self.type(0)
except ParseFail:
raise ValueError(f"Can't parse '{self.type_str}' as a type.")
if parsed.end != len(self.type_str):
raise ValueError(f"Can't parse '{self.type_str}' as a type.")
return self.type(0).type
def type(self, pos: int) -> Parsed:
try:
return self.basic_type(pos)
except ParseFail:
pass
try:
return self.timestamp(pos)
except ParseFail:
pass
try:
return self.list(pos)
except ParseFail:
pass
try:
return self.dictionary(pos)
except ParseFail:
pass
try:
return self.struct(pos)
except ParseFail:
pass
raise ParseFail()
def basic_type(self, pos: int) -> pa.DataType:
match = self.BASIC_TYPE_MATCHER.match(self.type_str, pos)
if match is None:
raise ParseFail()
try:
return Parsed(pa.type_for_alias(match.group(0)), match.end(0))
except ValueError:
pass
raise ParseFail()
def timestamp(self, pos: int) -> pa.DataType:
match = self.TIMESTAMP_MATCHER.match(self.type_str, pos)
if match is None:
raise ParseFail()
try:
return Parsed(pa.timestamp(match.group(1).strip(), tz=match.group(2).strip()), match.end(0))
except ValueError:
pass
raise ParseFail()
def list(self, pos: int) -> pa.DataType:
pos = self.accept('list<', pos)
match = self.NAME_MATCHER.match(self.type_str, pos)
if match is None:
raise ParseFail()
pos = self.accept(': ', match.end(0))
item = self.type(pos)
pos = self.accept('>', item.end)
return Parsed(pa.list_(item.type), pos)
def dictionary(self, pos: int) -> pa.DataType:
pos = self.accept('dictionary<values=', pos)
values = self.type(pos)
pos = self.accept(', indices=', values.end)
indices = self.type(pos)
pos = self.accept(', ordered=', indices.end)
try:
pos = self.accept('0', pos)
ordered = False
except ParseFail:
pos = self.accept('1', pos)
ordered = True
pos = self.accept('>', pos)
return Parsed(pa.dictionary(indices.type, values.type, ordered), pos)
def struct(self, pos: int) -> pa.DataType:
pos = self.accept('struct<', pos)
elements = []
while self.type_str[pos] != '>':
match = self.NAME_MATCHER.match(self.type_str, pos)
if match is None:
raise ParseFail()
element_name = match.group(0)
pos = self.accept(': ', match.end(0))
element_type = self.type(pos)
pos = element_type.end
if self.type_str[pos] != '>':
pos = self.accept(', ', pos)
elements.append((element_name, element_type.type))
pos = self.accept('>', pos)
return Parsed(pa.struct(elements), pos)
def accept(self, term: str, pos: int) -> int:
if self.type_str.startswith(term, pos):
return pos + len(term)
raise ParseFail()
Probably not the prettiest recursive descent parser in existence, but it does parse arbitrary nested types. The only restriction that I know of is that the names in the structs needs to be alphanumeric.
@takacsd Nice work on the parser! :) Ye struct
is the biggest issue with it being able to have column names. It gets worse if struct<struct : uint8>
exists - yikes that'll be a painful pain.
Also I just noticed but https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#L870 actually does in fact parse Arrow Dtypes! The issue is the code previous to it breaks, and it never gets there.
for field in table.schema:
typ = field.type
if isinstance(typ, pa.BaseExtensionType):
try:
pandas_dtype = typ.to_pandas_dtype()
except NotImplementedError:
pass
else:
ext_columns[field.name] = pandas_dtype
The issue is https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#LL854C1-L868C53:
# infer the extension columns from the pandas metadata
for col_meta in columns_metadata:
try:
name = col_meta['field_name']
except KeyError:
name = col_meta['name']
dtype = col_meta['numpy_type']
if dtype not in _pandas_supported_numpy_types:
# pandas_dtype is expensive, so avoid doing this for types
# that are certainly numpy dtypes
pandas_dtype = _pandas_api.pandas_dtype(dtype) >>>>>>>> BREAKS (A)
if isinstance(pandas_dtype, _pandas_api.extension_dtype):
if hasattr(pandas_dtype, "__from_arrow__"):
ext_columns[name] = pandas_dtype
I think I might have fixed it WITHOUT using try except
# infer the extension columns from the pandas metadata
schema = table.schema <<<<
for col_meta, field in zip(columns_metadata, schema): <<<<
try:
name = col_meta['field_name']
except KeyError:
name = col_meta['name']
dtype = col_meta['numpy_type']
if dtype not in _pandas_supported_numpy_types:
# pandas_dtype is expensive, so avoid doing this for types
# that are certainly numpy dtypes
if dtype.endswith("[pyarrow]"): pandas_dtype = pd.ArrowDtype(field.type) <<<< (1)
elif dtype == "string": pandas_dtype = pd.ArrowDtype(pa.string()) <<<< (2)
else: pandas_dtype = pd_dtype(dtype) <<<< (3)
if isinstance(pandas_dtype, _pandas_api.extension_dtype):
if hasattr(pandas_dtype, "__from_arrow__"):
ext_columns[name] = pandas_dtype
We push the original command (A)
to line (3)
than if a string data-type has [pyarrow]
as it's ending, we use pd.ArrowDtype
. For strings, we ignore the pandas parser and just parse strings in the fastpath.
This also means
for field in table.schema:
typ = field.type
if isinstance(typ, pa.BaseExtensionType):
try:
pandas_dtype = typ.to_pandas_dtype()
except NotImplementedError:
pass
else:
ext_columns[field.name] = pandas_dtype
can be deleted - it'ls redundant, since we folded the code into the previous code.
I just hit this today trying to read a parquet file made by someone else, where they had used the pyarrow backend.
Here is another minimal example to add to the mix that fails on reading df2
.
import io
import numpy as np
import pandas as pd
import pyarrow as pa
def main():
df0 = pd.DataFrame(
[
{"foo": {"bar": True, "baz": np.float32(1)}},
{"foo": {"bar": True, "baz": None}},
],
)
schema = pa.schema(
[
pa.field(
"foo",
pa.struct(
[
pa.field("bar", pa.bool_(), nullable=False),
pa.field("baz", pa.float32(), nullable=True),
],
),
),
],
)
print(schema)
with io.BytesIO() as stream0, io.BytesIO() as stream1:
kwargs = {
"engine": "pyarrow",
"compression": "zstd",
"schema": schema,
"row_group_size": 2_000,
}
print("Writing df0")
df0.to_parquet(stream0, **kwargs)
print("Reading df1")
stream0.seek(0)
df1 = pd.read_parquet(stream0, engine="pyarrow", dtype_backend="pyarrow")
print("Writing df1")
df1.to_parquet(stream1, **kwargs)
print("Reading df2")
stream1.seek(0)
df2 = pd.read_parquet(stream1, engine="pyarrow", dtype_backend="pyarrow")
if __name__ == "__main__":
main()
Using df2 = pq.read_table(stream1).to_pandas(ignore_metadata=True)
works for all of the reasons mentioned in the thread.
I'm running into this issue as well:
commit : f538741432edf55c6b9fb5d0d496d2dd1d7c2457 python : 3.11.7.final.0 python-bits : 64 OS : Linux OS-release : 4.13.9-300.fc27.x86_64 Version : #1 SMP Mon Oct 23 13:41:58 UTC 2017 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.2.0 numpy : 1.26.3 pytz : 2023.4 dateutil : 2.8.2 setuptools : 69.0.3 pip : 23.3.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.3 IPython : 8.20.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : 2023.10.1 fsspec : 2023.12.2 gcsfs : None matplotlib : 3.8.2 numba : 0.58.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : 2.0.25 tables : None tabulate : None xarray : 2024.1.1 xlrd : None zstandard : None tzdata : 2023.4 qtpy : 2.4.1 pyqt5 : None
The only "workaround" at the pandas-level I've found is to set df.to_parquet(..., store_schema=False)
for a df
containing complex/nested types like a categorical. Are there any plans to get successful dtype roundtripping going forward? Is there anything in the pyarrow library we can leverage here?
Versions:
I would recommend using dtype_backend="pyarrow"
I would recommend using
dtype_backend="pyarrow"
@phoff, not sure if you saw from my screenshot, but I did apply the dypte_backend="pyarrow" to the read_parquet method and it still fails, unless I am misunderstanding your suggestion
The only workaround I have found so far is the following (which works in all cases I have thought of, except round-tripping an empty dataframe with a struct or list type, setting the schema, and not using dtype_backend="pyarrow"
when reading back in).
Would def welcome suggested improvements to this workaround!
Obv you can write this differently if you don't want a byte string returned, but for us that's what we want.
def serialize(self, data: pd.DataFrame, **kwargs) -> bytes:
"""see BytesWriter.serialize -- Dump pandas dataframe to parquet bytes"""
with io.BytesIO() as stream:
schema = kwargs.pop("schema", None)
all_arrow_types = all(isinstance(t, pd.ArrowDtype) for t in data.dtypes.tolist())
# An empty dataframe may use default dtypes that are incompatible with the schema.
# In this case, first cast to object, as the schema can always convert that to the correct type.
if len(data) == 0 and schema is not None and not all_arrow_types:
data = data.astype("object").astype({n: pd.ArrowDtype(schema.field(n).type) for n in schema.names})
table = pa.Table.from_pandas(data, schema=schema)
# drop pandas from the schema metadata to work around the bug where you can't read struct columns with
# pandas metadata
# see https://github.com/pandas-dev/pandas/issues/53011
metadata = table.schema.metadata
if b"pandas" in metadata and b"list" in metadata[b"pandas"] or b"struct" in metadata[b"pandas"]:
del metadata[b"pandas"]
table = table.replace_schema_metadata(metadata)
pq.write_table(table, stream, **kwargs)
return stream.getvalue()
This bug is very similar to #57411 where the data type is list[int] instead of list[str].
Actually a simpler solution is to directly all
.replace
on the string and replacelist<element:
topa.list_((
etc.
Additionally, list<element:
can also be list<item:
depending on how the Parquet was written out (see: use_compliant_nested_type
argument to pyarrow.parquet.ParquetWriter). And within Arrow itself the containing name be arbitrary (ie. pyarrow.list_(pyarrow.field('some_name', pyarrow.int64()))
is a valid type). So the replace would have to be a regex and something like r'list<(.+):'
with the capturing group grabbing the name of the field.
Matching arrow ticket: https://github.com/apache/arrow/issues/39914 and potential PR: https://github.com/apache/arrow/pull/44720
Only seeing this long ticket now .. FWIW I think this is another good reason it would be good pandas had tighter control over the pandas<->arrow conversion (https://github.com/pandas-dev/pandas/issues/59780)
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Great work on extending Arrow to Pandas! Using
pd.ArrowDtype(pa.list_(pa.string()))
or any other alteration works in the Parquet saving mode, but fails during the reading of the parquet file.In fact, if there is a Pandas Series of pure lists of strings for eg
["a"], ["a", "b"]
, Parquet saves it internally as alist[string]
type. When Pandas reads the parquet file, it then converts it to an object type.Is there a way during the reading step to either:
object
type ORpd.Series(pd.arrays.ArrowExtensionArray(x))
seems to actually work! Maybe during the conversion from the internal Pyarrow representation into Pandas, we can usepd.Series(pd.arrays.ArrowExtensionArray(x))
on columns which had errors? ORExpected Behavior
Installed Versions