pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.77k stars 17.96k forks source link

BUG: Problem in assigning multiple items to ListType array members #53178

Open martinschi opened 1 year ago

martinschi commented 1 year ago

https://github.com/pandas-dev/pandas/blob/3827cafb2390824a05bc81d1bcc045c67e27606d/pandas/core/arrays/arrow/array.py#L1420

^^^ this guards against assigning "list like" objects to array member. However if the array itself is array of ListType members, then the code should allow it, otherwise there's seemingly no way how to assign multiple items to single array position.

E.g.,

import pyarrow as pa
import pandas as pd

list_str_type = pa.list_(pa.string())
ser = pd.Series([['foo'], ['bar'], ['foobar']], dtype=pd.ArrowDtype(list_str_type))

a = pa.array(['hello', 'world'], type=pa.string())
s = pa.scalar(['hello', 'world'], type=list_str_type)

# following fails with ValueError("Length of indexer and values mismatch")
ser[1] = a

# following also fails with ValueError("Length of indexer and values mismatch")
ser[1] = s
pd.show_versions() INSTALLED VERSIONS ------------------ commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.11.3.final.0 python-bits : 64 OS : Darwin OS-release : 22.4.0 Version : Darwin Kernel Version 22.4.0: Mon Mar 6 21:00:17 PST 2023; root:xnu-8796.101.5~3/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.1 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.1.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 12.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
topper-123 commented 1 year ago

thanks for the report, @martinschi.

Yes, the issue is that is_list_like(s) returns True.

I am not familiar with this data type (a list that is considered a scalar?), and it seems a bit unusual, being both scalar and list. Can you explain how it is used? Do you think this data type should always be treated like a scalar in pandas, i.e. is_list_like(s) should return False and is_scalar(s) should return True?

martinschi commented 1 year ago

Well, I'm new to this, so not sure ;-). I'm just trying to re-implement my legacy Pandas code with pyarrow-based series.

The legacy code assigns one or more string values to each series member, such as:

In [132]: ser = pd.Series([['foo'], ['bar'], ['foobar']])

In [133]: ser
Out[133]:
0       [foo]
1       [bar]
2    [foobar]
dtype: object

In [134]: ser[1] = ['hello','world']

In [135]: ser
Out[135]:
0             [foo]
1    [hello, world]
2          [foobar]
dtype: object

I'm trying to do the same with pyarrow-based series, but I can't figure out how to assign more values to single series member. Basically whatever way I try, it fails, including the two methods mentioned above, and using common python list below:

In [136]: list_str_type = pa.list_(pa.string())
     ...: ser = pd.Series([['foo'], ['bar'], ['foobar']], dtype=pd.ArrowDtype(list_str_type))

In [137]: ser
Out[137]:
0       ['foo']
1       ['bar']
2    ['foobar']
dtype: list<item: string>[pyarrow]

In [138]: ser[1] = ['hello','world']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/series.py:1159, in Series.__setitem__(self, key, value)
   1158 try:
-> 1159     self._set_with_engine(key, value)
   1160 except KeyError:
   1161     # We have a scalar (or for MultiIndex or object-dtype, scalar-like)
   1162     #  key that is not present in self.index.

File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/series.py:1225, in Series._set_with_engine(self, key, value)
   1224 # this is equivalent to self._values[key] = value
-> 1225 self._mgr.setitem_inplace(loc, value)

File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/internals/managers.py:2042, in SingleBlockManager.setitem_inplace(self, indexer, value)
   2040     self._cache.clear()
-> 2042 super().setitem_inplace(indexer, value)

File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/internals/base.py:189, in SingleDataManager.setitem_inplace(self, indexer, value)
    187     value = np_can_hold_element(arr.dtype, value)
--> 189 arr[indexer] = value

File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py:1371, in ArrowExtensionArray.__setitem__(self, key, value)
   1370 if is_list_like(value):
-> 1371     raise ValueError("Length of indexer and values mismatch")
   1372 elif isinstance(value, pa.Scalar):

Obviously, when you have "list of something" type of series, then the value you'll try to assign to series member naturally will be list like.

topper-123 commented 1 year ago

Ok, thanks @martinschi.

Maybe this can be fixed by changing the last lines in your traceback to:

if isinstance(value, pa.Scalar):
    value = value.as_py()
if is_list_like(value):
    raise ValueError("Length of indexer and values mismatch"

However, this situation probably boils down to pa.list_(pa.string()) not being sufficiently supported in pandas. We could probably use some fixes here + tests.