pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.75k stars 17.96k forks source link

BUG: list-like objects are broadcast to each row (1.3 regression) #42549

Open erik-hasse opened 3 years ago

erik-hasse commented 3 years ago

Code Sample, a copy-pastable example

This requires both pandas and grpcio-tools.

import subprocess
import sys

import pandas as pd

proto = """
syntax="proto3";

message Data {
  repeated float values = 1;
}
"""

with open('data.proto', 'w') as f:
    f.write(proto)

subprocess.run([
    sys.executable, '-m', 'grpc_tools.protoc', '-I.',
    '--python_out=.', '--grpc_python_out=.', 'data.proto'
])

from data_pb2 import Data

proto_data = Data(values=range(3))
print(proto_data.values)
print(type(proto_data.values))
df = pd.DataFrame(index=range(3), data={'a': proto_data.values})
print(df)

Problem description

On 1.3.0 and the master branch this code prints:

[0.0, 1.0, 2.0]
<class 'google.protobuf.pyext._message.RepeatedScalarContainer'>
                 a
0  [0.0, 1.0, 2.0]
1  [0.0, 1.0, 2.0]
2  [0.0, 1.0, 2.0]

The issue seems to arise from #41592. In 1.2.x this object was handled by _try_cast in this else clause. After that change, it's handled by construct_1d_arraylike_from_scalar because is_list_like(data) returns False. Note that RepeatedScalarContainer implements PyTypeObject.tp_as_sequence but not PyTypeObject.tp_iter, so list(proto_data.values) works fine, but the hasattr(obj, "__iter__") check in is_list_like is False.

Based on all this, I suspect that this same issue will occur on any object which implements PyTypeObject.tp_as_sequence but not PyTypeObject.tp_iter, however this protobuf object the only example I have right now so I can't test further.

I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of obj in is_list_like? If so, that function could be amended to check hasattr(obj, "__iter__") or hasattr(obj, "tp_as_sequence"). If not I think the logic of sanitize_array needs to be amended.

Expected Output

On 1.2.x this prints:

[0.0, 1.0, 2.0]
<class 'google.protobuf.pyext._message.RepeatedScalarContainer'>
     a
0  0.0
1  1.0
2  2.0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f00ed8f47020034e752baf0250483053340971b0 python : 3.8.6.final.0 python-bits : 64 OS : Darwin OS-release : 20.3.0 Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.0 numpy : 1.20.3 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 53.0.0 Cython : 0.29.22 pytest : 6.2.2 hypothesis : 6.3.4 sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.3 IPython : 7.20.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 3.0.0 pyxlsb : None s3fs : None scipy : 1.6.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
MeisterP commented 3 years ago

I see the same issue when using pandas 1.3 from within Goldencheetah https://groups.google.com/g/golden-cheetah-developers/c/lpHNWKdNICY

simonjayhawkins commented 3 years ago

The issue seems to arise from #41592

cc @jbrockmendel

jbrockmendel commented 3 years ago

I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of obj in is_list_like? If so, that function could be amended to check hasattr(obj, "iter") or hasattr(obj, "tp_as_sequence"). If not I think the logic of sanitize_array needs to be amended.

Would PySequence_Check(obj) return True for this object?

erik-hasse commented 3 years ago

I can confirm that it does. The following prints True

from cpython.sequence cimport PySequence_Check
from data_pb2 import Data

proto_data = Data(values=range(3))

print(PySequence_Check(proto_data.values))
jbrockmendel commented 3 years ago

When I change the check from hasattr(obj, "__iter__") to PySequence_Check(obj) we fail to identify zip objects as listlike. We could make the check hasattr(obj, "__iter__") or PySequence_Check(obj). I don't know enough about PySequence_Check to be confident it won't produce false-positives. @erik-hasse any idea?

aiudirog commented 3 years ago

@jbrockmendel I came across this while looking for any issues that may be related to #43373. Modifying is_list_like to look like this should work:

cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
    cdef object iterfunc = getattr(obj, "__iter__", ...)
    if iterfunc is None:  # Explicitly not iterable
        return False
    return (
        (iterfunc is not ... or PySequence_Check(obj))
        and not isinstance(obj, type)
        # we do not count strings/unicode/bytes as list-like
        and not isinstance(obj, (str, bytes))
        # exclude zero-dimensional numpy arrays, effectively scalars
        and not cnp.PyArray_IsZeroDim(obj)
        # exclude sets if allow_sets is False
        and not (allow_sets is False and isinstance(obj, abc.Set))
    )

Edit: Realized I forgot to explain myself. Using hasattr(obj, "__iter__") or PySequence_Check(obj) or getattr(obj, "__iter__", None) is not None or PySequence_Check(obj) would result in false positives when __iter__ is explicitly set to None and __getitem__ is implemented. Therefore, we need to explicitly verify that __iter__ is not None before proceeding. Since we needed a new sentinel value, I chose ... but maybe that isn't the best way.

When using the builtin iter function, this is taken care of in slot_tp_iter when called (indirectly) by PyObject_GetIter.

MeisterP commented 3 years ago

Modifying is_list_like to look like this should work

For what it's worth, the change proposed in https://github.com/pandas-dev/pandas/issues/42549#issuecomment-913012862 does fix my Goldencheetah use case.

jbrockmendel commented 3 years ago

is_listlike has been changed to use getattr(obj, "__iter__", None) is not None and ... does this solve the original issue @erik-hasse ?

erik-hasse commented 3 years ago

Unfortunately no, because the object in the original issue does not have an __iter__ attribute. I still see the results in the issue on the current master branch 1.4.0.dev0+839.gab727bfee4.

As far as I can tell, PySequence_Check and try/except-ing list(obj) are the only ways to actually detect objects like this. But I definitely see the downsides of both.

aiudirog commented 3 years ago

@jbrockmendel That change is for when __iter__ is present but explicitly set to None. My usage for it is actually the opposite of @erik-hasse, where I had an object which implemented __getitem__ but was not iterable through indexing 0-N. His use case is an object that is only iterable through __getitem__ and doesn't have __iter__ implemented at all. The code I posted previously should solve both problems correctly.

jbrockmendel commented 3 years ago

@erik-hasse we'll need to check the perf impact but it sounds like PySequence_Check is the way to go here (definitely not list(obj)). PR welcome.

@aiudirog it is not obvious to me that we should see an item as listlike if item.__iter__ is None

aiudirog commented 3 years ago

You shouldn't. It's explicitly not list like if __iter__ is None. That's how you explicitly disable iteration while still implementing __getitem__.

Here you can see the source code for collectiona.abc.Iterable which performs the check: https://github.com/python/cpython/blob/v3.9.7/Lib/_collections_abc.py#L265

https://github.com/python/cpython/blob/v3.9.7/Lib/_collections_abc.py#L83

simonjayhawkins commented 3 years ago

changing milestone to 1.3.5

jreback commented 2 years ago

good to fix but not needed for 1.3.x

simonjayhawkins commented 2 years ago

Code Sample, a copy-pastable example

This requires both pandas and grpcio-tools.

as an MRE maybe...

import pandas as pd

print(pd.__version__)

class MySequence:
    def __getitem__(self, key):
        return range(3)[key]

    def __len__(self):
        return 3

my_sequence = MySequence()

df = pd.DataFrame(index=range(3), data={"a": my_sequence})
print(df)

on 1.2.5

1.2.5
   a
0  0
1  1
2  2

and on pandas 1.3.x/master

1.4.0.dev0+1286.g9db55a7524
           a
0  (0, 1, 2)
1  (0, 1, 2)
2  (0, 1, 2)
simonjayhawkins commented 2 years ago

Modifying is_list_like to look like this should work

For what it's worth, the change proposed in #42549 (comment) does fix my Goldencheetah use case.

tbc the code diff for the proposed change would be...

diff --git a/pandas/_libs/lib.pyx b/pandas/_libs/lib.pyx
index f527882a9d..ecd2906744 100644
--- a/pandas/_libs/lib.pyx
+++ b/pandas/_libs/lib.pyx
@@ -1098,9 +1098,12 @@ def is_list_like(obj: object, allow_sets: bool = True) -> bool:

 cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
+    cdef object iterfunc = getattr(obj, "__iter__", ...)
+    if iterfunc is None:  # Explicitly not iterable
+        return False
     return (
-        # equiv: `isinstance(obj, abc.Iterable)`
-        getattr(obj, "__iter__", None) is not None and not isinstance(obj, type)
+        (iterfunc is not ... or PySequence_Check(obj))
+        and not isinstance(obj, type)
         # we do not count strings/unicode/bytes as list-like
         and not isinstance(obj, (str, bytes))
         # exclude zero-dimensional numpy arrays, effectively scalars

but this causes an infinite loop when testing with 1 core and node down for multicore for test_pprint_pathological_object

All other tests pass.

simonjayhawkins commented 2 years ago

good to fix but not needed for 1.3.x

moving off milestone

simonjayhawkins commented 2 years ago

The issue seems to arise from #41592.

first bad commit: [cec2f5fa639f707b9e8a1ad47b0e0f276c0a3679] REF: handle non-list_like cases upfront in sanitize_array (#38563)

In 1.2.x this object was handled by _try_cast in this else clause. After that change, it's handled by construct_1d_arraylike_from_scalar because is_list_like(data) returns False.

so it appears that the regression could have been caused by the change from lib.is_scalar(data) and index is not None and dtype is not None: to elif not is_list_like(data): causing the change from handled by _try_cast to handled by construct_1d_arraylike_from_scalar as identified. (i've not yet confirmed this)

also tbc is_list_like(my_sequence) returned False for the MRE from https://github.com/pandas-dev/pandas/issues/42549#issuecomment-984620539 in 1.2.5 also.

```python import pandas as pd print(pd.__version__) class MySequence: def __getitem__(self, key): return range(3)[key] def __len__(self): return 3 my_sequence = MySequence() print(pd._libs.lib.is_list_like(my_sequence)) ``` ``` 1.2.5 False ```

good to fix but not needed for 1.3.x

moving off milestone

tbc i've moved off the milestone since IMO the proposed change is not suitable for backport since not only is tested behavior changed (i.e. test_pprint_pathological_object) but also the change is not directly related to the change that caused the regression. The proposed change could, assuming the recursion issue is sorted, be included for 1.4.

If we want to fix this for 1.3.x, we could probably consider re-instating the data = list(data) materialization in sanitize_array from #41592 or reverting the is_list_like back to the is_scalar check, the direct cause of the regression and not make any changes to is_list_like in a patch release.

PRs welcome either way.

bkyryliuk commented 1 year ago

What is a earliest version of pandas that has this issue fixed? Is it affecting 1.4.X ?

MeisterP commented 1 year ago

What is a earliest version of pandas that has this issue fixed? Is it affecting 1.4.X ?

It is still an issue with pandas 2.0.0. It was never fixed.