Open erik-hasse opened 3 years ago
I see the same issue when using pandas 1.3 from within Goldencheetah https://groups.google.com/g/golden-cheetah-developers/c/lpHNWKdNICY
The issue seems to arise from #41592
cc @jbrockmendel
I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of obj in is_list_like? If so, that function could be amended to check hasattr(obj, "iter") or hasattr(obj, "tp_as_sequence"). If not I think the logic of sanitize_array needs to be amended.
Would PySequence_Check(obj)
return True for this object?
I can confirm that it does. The following prints True
from cpython.sequence cimport PySequence_Check
from data_pb2 import Data
proto_data = Data(values=range(3))
print(PySequence_Check(proto_data.values))
When I change the check from hasattr(obj, "__iter__")
to PySequence_Check(obj)
we fail to identify zip
objects as listlike. We could make the check hasattr(obj, "__iter__") or PySequence_Check(obj)
. I don't know enough about PySequence_Check to be confident it won't produce false-positives. @erik-hasse any idea?
@jbrockmendel I came across this while looking for any issues that may be related to #43373. Modifying is_list_like
to look like this should work:
cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
cdef object iterfunc = getattr(obj, "__iter__", ...)
if iterfunc is None: # Explicitly not iterable
return False
return (
(iterfunc is not ... or PySequence_Check(obj))
and not isinstance(obj, type)
# we do not count strings/unicode/bytes as list-like
and not isinstance(obj, (str, bytes))
# exclude zero-dimensional numpy arrays, effectively scalars
and not cnp.PyArray_IsZeroDim(obj)
# exclude sets if allow_sets is False
and not (allow_sets is False and isinstance(obj, abc.Set))
)
Edit: Realized I forgot to explain myself. Using hasattr(obj, "__iter__") or PySequence_Check(obj)
or getattr(obj, "__iter__", None) is not None or PySequence_Check(obj)
would result in false positives when __iter__
is explicitly set to None
and __getitem__
is implemented. Therefore, we need to explicitly verify that __iter__
is not None
before proceeding. Since we needed a new sentinel value, I chose ...
but maybe that isn't the best way.
When using the builtin iter
function, this is taken care of in slot_tp_iter
when called (indirectly) by PyObject_GetIter
.
Modifying
is_list_like
to look like this should work
For what it's worth, the change proposed in https://github.com/pandas-dev/pandas/issues/42549#issuecomment-913012862 does fix my Goldencheetah use case.
is_listlike has been changed to use getattr(obj, "__iter__", None) is not None and ...
does this solve the original issue @erik-hasse ?
Unfortunately no, because the object in the original issue does not have an __iter__
attribute. I still see the results in the issue on the current master branch 1.4.0.dev0+839.gab727bfee4
.
As far as I can tell, PySequence_Check
and try/except-ing list(obj)
are the only ways to actually detect objects like this. But I definitely see the downsides of both.
@jbrockmendel That change is for when __iter__
is present but explicitly set to None. My usage for it is actually the opposite of @erik-hasse, where I had an object which implemented __getitem__
but was not iterable through indexing 0-N. His use case is an object that is only iterable through __getitem__
and doesn't have __iter__
implemented at all. The code I posted previously should solve both problems correctly.
@erik-hasse we'll need to check the perf impact but it sounds like PySequence_Check is the way to go here (definitely not list(obj)
). PR welcome.
@aiudirog it is not obvious to me that we should see an item as listlike if item.__iter__ is None
You shouldn't. It's explicitly not list like if __iter__
is None. That's how you explicitly disable iteration while still implementing __getitem__
.
Here you can see the source code for collectiona.abc.Iterable
which performs the check:
https://github.com/python/cpython/blob/v3.9.7/Lib/_collections_abc.py#L265
https://github.com/python/cpython/blob/v3.9.7/Lib/_collections_abc.py#L83
changing milestone to 1.3.5
good to fix but not needed for 1.3.x
Code Sample, a copy-pastable example
This requires both pandas and grpcio-tools.
as an MRE maybe...
import pandas as pd
print(pd.__version__)
class MySequence:
def __getitem__(self, key):
return range(3)[key]
def __len__(self):
return 3
my_sequence = MySequence()
df = pd.DataFrame(index=range(3), data={"a": my_sequence})
print(df)
on 1.2.5
1.2.5
a
0 0
1 1
2 2
and on pandas 1.3.x/master
1.4.0.dev0+1286.g9db55a7524
a
0 (0, 1, 2)
1 (0, 1, 2)
2 (0, 1, 2)
Modifying
is_list_like
to look like this should workFor what it's worth, the change proposed in #42549 (comment) does fix my Goldencheetah use case.
tbc the code diff for the proposed change would be...
diff --git a/pandas/_libs/lib.pyx b/pandas/_libs/lib.pyx
index f527882a9d..ecd2906744 100644
--- a/pandas/_libs/lib.pyx
+++ b/pandas/_libs/lib.pyx
@@ -1098,9 +1098,12 @@ def is_list_like(obj: object, allow_sets: bool = True) -> bool:
cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
+ cdef object iterfunc = getattr(obj, "__iter__", ...)
+ if iterfunc is None: # Explicitly not iterable
+ return False
return (
- # equiv: `isinstance(obj, abc.Iterable)`
- getattr(obj, "__iter__", None) is not None and not isinstance(obj, type)
+ (iterfunc is not ... or PySequence_Check(obj))
+ and not isinstance(obj, type)
# we do not count strings/unicode/bytes as list-like
and not isinstance(obj, (str, bytes))
# exclude zero-dimensional numpy arrays, effectively scalars
but this causes an infinite loop when testing with 1 core and node down for multicore for test_pprint_pathological_object
All other tests pass.
good to fix but not needed for 1.3.x
moving off milestone
The issue seems to arise from #41592.
first bad commit: [cec2f5fa639f707b9e8a1ad47b0e0f276c0a3679] REF: handle non-list_like cases upfront in sanitize_array (#38563)
In 1.2.x this object was handled by
_try_cast
in this else clause. After that change, it's handled byconstruct_1d_arraylike_from_scalar
becauseis_list_like(data)
returns False.
so it appears that the regression could have been caused by the change from lib.is_scalar(data) and index is not None and dtype is not None:
to elif not is_list_like(data):
causing the change from handled by _try_cast
to handled by construct_1d_arraylike_from_scalar
as identified. (i've not yet confirmed this)
also tbc is_list_like(my_sequence)
returned False
for the MRE from https://github.com/pandas-dev/pandas/issues/42549#issuecomment-984620539 in 1.2.5 also.
good to fix but not needed for 1.3.x
moving off milestone
tbc i've moved off the milestone since IMO the proposed change is not suitable for backport since not only is tested behavior changed (i.e. test_pprint_pathological_object
) but also the change is not directly related to the change that caused the regression. The proposed change could, assuming the recursion issue is sorted, be included for 1.4.
If we want to fix this for 1.3.x, we could probably consider re-instating the data = list(data)
materialization in sanitize_array
from #41592 or reverting the is_list_like
back to the is_scalar
check, the direct cause of the regression and not make any changes to is_list_like
in a patch release.
PRs welcome either way.
What is a earliest version of pandas that has this issue fixed? Is it affecting 1.4.X ?
What is a earliest version of pandas that has this issue fixed? Is it affecting 1.4.X ?
It is still an issue with pandas 2.0.0. It was never fixed.
[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[x] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
This requires both pandas and grpcio-tools.
Problem description
On 1.3.0 and the master branch this code prints:
The issue seems to arise from #41592. In 1.2.x this object was handled by
_try_cast
in this else clause. After that change, it's handled byconstruct_1d_arraylike_from_scalar
becauseis_list_like(data)
returns False. Note that RepeatedScalarContainer implementsPyTypeObject.tp_as_sequence
but notPyTypeObject.tp_iter
, solist(proto_data.values)
works fine, but thehasattr(obj, "__iter__")
check inis_list_like
is False.Based on all this, I suspect that this same issue will occur on any object which implements
PyTypeObject.tp_as_sequence
but notPyTypeObject.tp_iter
, however this protobuf object the only example I have right now so I can't test further.I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of
obj
inis_list_like
? If so, that function could be amended to checkhasattr(obj, "__iter__") or hasattr(obj, "tp_as_sequence")
. If not I think the logic ofsanitize_array
needs to be amended.Expected Output
On 1.2.x this prints:
Output of
pd.show_versions()