Open martinschi opened 1 year ago
thanks for the report, @martinschi.
Yes, the issue is that is_list_like(s)
returns True.
I am not familiar with this data type (a list that is considered a scalar?), and it seems a bit unusual, being both scalar and list. Can you explain how it is used? Do you think this data type should always be treated like a scalar in pandas, i.e. is_list_like(s)
should return False and is_scalar(s)
should return True?
Well, I'm new to this, so not sure ;-). I'm just trying to re-implement my legacy Pandas code with pyarrow-based series.
The legacy code assigns one or more string values to each series member, such as:
In [132]: ser = pd.Series([['foo'], ['bar'], ['foobar']])
In [133]: ser
Out[133]:
0 [foo]
1 [bar]
2 [foobar]
dtype: object
In [134]: ser[1] = ['hello','world']
In [135]: ser
Out[135]:
0 [foo]
1 [hello, world]
2 [foobar]
dtype: object
I'm trying to do the same with pyarrow-based series, but I can't figure out how to assign more values to single series member. Basically whatever way I try, it fails, including the two methods mentioned above, and using common python list below:
In [136]: list_str_type = pa.list_(pa.string())
...: ser = pd.Series([['foo'], ['bar'], ['foobar']], dtype=pd.ArrowDtype(list_str_type))
In [137]: ser
Out[137]:
0 ['foo']
1 ['bar']
2 ['foobar']
dtype: list<item: string>[pyarrow]
In [138]: ser[1] = ['hello','world']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/series.py:1159, in Series.__setitem__(self, key, value)
1158 try:
-> 1159 self._set_with_engine(key, value)
1160 except KeyError:
1161 # We have a scalar (or for MultiIndex or object-dtype, scalar-like)
1162 # key that is not present in self.index.
File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/series.py:1225, in Series._set_with_engine(self, key, value)
1224 # this is equivalent to self._values[key] = value
-> 1225 self._mgr.setitem_inplace(loc, value)
File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/internals/managers.py:2042, in SingleBlockManager.setitem_inplace(self, indexer, value)
2040 self._cache.clear()
-> 2042 super().setitem_inplace(indexer, value)
File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/internals/base.py:189, in SingleDataManager.setitem_inplace(self, indexer, value)
187 value = np_can_hold_element(arr.dtype, value)
--> 189 arr[indexer] = value
File ~/.pyenv/versions/3.11.3/envs/hydro-v2/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py:1371, in ArrowExtensionArray.__setitem__(self, key, value)
1370 if is_list_like(value):
-> 1371 raise ValueError("Length of indexer and values mismatch")
1372 elif isinstance(value, pa.Scalar):
Obviously, when you have "list of something" type of series, then the value you'll try to assign to series member naturally will be list like.
Ok, thanks @martinschi.
Maybe this can be fixed by changing the last lines in your traceback to:
if isinstance(value, pa.Scalar):
value = value.as_py()
if is_list_like(value):
raise ValueError("Length of indexer and values mismatch"
However, this situation probably boils down to pa.list_(pa.string())
not being sufficiently supported in pandas. We could probably use some fixes here + tests.
https://github.com/pandas-dev/pandas/blob/3827cafb2390824a05bc81d1bcc045c67e27606d/pandas/core/arrays/arrow/array.py#L1420
^^^ this guards against assigning "list like" objects to array member. However if the array itself is array of ListType members, then the code should allow it, otherwise there's seemingly no way how to assign multiple items to single array position.
E.g.,