pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.77k stars 17.97k forks source link

BUG: _repr_*_ methods should not iterate over Sequence when dtype=object #44799

Open randolf-scholz opened 2 years ago

randolf-scholz commented 2 years ago

Reproducible Example

import pandas as pd
d = {0: range(3), 1: range(100)}
s = pd.Series(d, dtype=object)
print(s)

Gives

0                                            (0, 1, 2)
1    (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
dtype: object

Instead of

0      range(0, 3)
1    range(0, 100)
dtype: object

Issue Description

When displaying a Series/DataFrame of dtype=object whose content conform to the collections.abc.Sequence protocol, pandas tries to iterate over these objects.

To see why this is generally a bad idea, consider the following example:

import pandas as pd
from time import sleep

class MyIterator:
    def __len__(self):
        return 100

    def __getitem__(self, key):
        print(f"computing {key=} ...")
        sleep(3)

d = {0: MyIterator(), 1: range(100)}
s = pd.Series(d, dtype=object)
print(s)

Where would this occur in practice? Well in my case I tried to store some torch.utils.data.DataLoader objects in a Series in order to leverage the powerful pandas Multi-Indexing over hierarchical cross-validation splits. In this case, printing the Series in a Jupyter Notebook would take 5+ minutes, whereas instantiating it was practically instantaneous. This is especially problematic when using Jupyter with %config InteractiveShell.ast_node_interactivity='last_expr_or_assign' mode.

Expected Behavior

When dtype=object then pandas should use the repr method of the object in order to get a string representation and not try to do something fancy. Possibly one can make some exception / special cases for python bulit-ins such as tuple and list. (I presume the current behaviour is the way it is to deal with these two when they hold lots of items)

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.9.7.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-41-generic Version : #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.4 numpy : 1.21.4 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 59.4.0 Cython : None pytest : 6.2.5 hypothesis : None sphinx : 4.3.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 7.30.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : None fsspec : 2021.11.1 fastparquet : 0.7.2 gcsfs : None matplotlib : 3.5.0 numexpr : 2.7.3 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 6.0.1 pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : 1.4.27 tables : 3.6.1 tabulate : 0.8.9 xarray : 0.20.1 xlrd : None xlwt : None numba : 0.53.1 ```
attack68 commented 2 years ago

Just noting that for display in Jupyter Notebook there are differences between DataFrame.to_html, which uses the DataFrameFormatter, and Styler.to_html, which uses the native str method for objects that it renders via a jinja2 template, i.e. a difference between range objects and numpy arrays.

Eg.

DataFrame

Screen Shot 2021-12-07 at 09 43 34

Styler

Screen Shot 2021-12-07 at 09 44 19

mroeschke commented 2 years ago

I think changing the repr behavior with Sequences would be a large breaking change since this has been the behavior for a while. Moreover, generally storing Sequence-like objects in pandas is an anti-pattern and discouraged.

randolf-scholz commented 2 years ago

I think changing the repr behavior with Sequences would be a large breaking change since this has been the behavior for a while.

Well, the current behaviour makes pandas quite unusable in certain circumstances as I pointed out. What is better: Changing the way a pretty printing routine works, or have it not work at all?

Moreover, generally storing Sequence-like objects in pandas is an anti-pattern and discouraged.

I very strongly disagree. Why is it an anti-pattern? Shouldn't it be a design goal to have proper support for dtype=object, even if the objects happen to be sequence-like?

jreback commented 2 years ago

I very strongly disagree. Why is it an anti-pattern? Shouldn't it be a design goal to have proper support for dtype=object, even if the objects happen to be sequence-like?

this was never a design goal. could it / should it. maybe, with enough community support. proper column typing is the key here, e.g. a List or Dict or JSON extension types could go a long way here.

randolf-scholz commented 2 years ago

I identified the following if-else code from pandas.io.formats.printing.pprint_thing as being responsible for the current behaviour:

if hasattr(thing, "__next__"):
    return str(thing)
elif isinstance(thing, dict) and _nest_lvl < get_option(
    "display.pprint_nest_depth"
):
    result = _pprint_dict(
        thing, _nest_lvl, quote_strings=True, max_seq_items=max_seq_items
    )
elif is_sequence(thing) and _nest_lvl < get_option("display.pprint_nest_depth"):
    result = _pprint_seq(
        thing,
        _nest_lvl,
        escape_chars=escape_chars,
        quote_strings=quote_strings,
        max_seq_items=max_seq_items,
    )
elif isinstance(thing, str) and quote_strings:
    result = f"'{as_escaped_string(thing)}'"
else:
    result = as_escaped_string(thing)
Traced from - [`pandas.io.formats.format.GenericArrayFormatter`](https://github.com/pandas-dev/pandas/blob/39b7a6d5614dc2a3e0ce9d799fa1dc98c1e3cfd5/pandas/io/formats/format.py#L1321) - [`pandas.io.formats.format.SeriesFormatter`](https://github.com/pandas-dev/pandas/blob/39b7a6d5614dc2a3e0ce9d799fa1dc98c1e3cfd5/pandas/io/formats/format.py#L263) - [`pandas.core.series.Series.__repr__`](https://github.com/pandas-dev/pandas/blob/39b7a6d5614dc2a3e0ce9d799fa1dc98c1e3cfd5/pandas/core/series.py#L1514)

and then pandas.io.formats.printing._pprint_seq takes care of the sequence formatting. A couple of observations:

As we can see from the function body of [`pandas.io.formats.printing._pprint_seq`](https://github.com/pandas-dev/pandas/blob/ad190575aa75962d2d0eade2de81a5fe5a2e285b/pandas/io/formats/printing.py#L98), there is quite some arbitrary formatting going on. ```python def _pprint_seq( seq: Sequence, _nest_lvl: int = 0, max_seq_items: int | None = None, **kwds ) -> str: """ internal. pprinter for iterables. you should probably use pprint_thing() rather than calling this directly. bounds length of printed sequence, depending on options """ if isinstance(seq, set): fmt = "{{{body}}}" else: fmt = "[{body}]" if hasattr(seq, "__setitem__") else "({body})" if max_seq_items is False: nitems = len(seq) else: nitems = max_seq_items or get_option("max_seq_items") or len(seq) s = iter(seq) # handle sets, no slicing r = [ pprint_thing(next(s), _nest_lvl + 1, max_seq_items=max_seq_items, **kwds) for i in range(min(nitems, len(seq))) ] body = ", ".join(r) if nitems < len(seq): body += ", ..." elif isinstance(seq, tuple) and len(seq) == 1: body += "," return fmt.format(body=body) ```

My proposal would be to do replace this if-else check with the following:

if is_builtin(thing):
    # possibly special formatting for builtin / otherwise known objects
   result = format_builtin(thing)
elif hasattr(thing, "__repr__"):
   # Do not reinvent the wheel, use the formatting provided by the object!
   body = repr(thing)
   # strip `\n`, limit to max_characters and escape string
   result =  apply_formatting(body)
else:
    # some fallback option
    result = as_escaped_string(thing)

Additionally, one could think about adding special treatment for large objects, e.g. if len(thing) is much larger than max_characters, there seems to be little point for iterating over the whole thing.

randolf-scholz commented 2 years ago

Also here are a couple more reasons why calling __iter__ on custom data types (or even builtin data types) stored in the Series/DataFrame might be a bad idea: