Open randolf-scholz opened 2 years ago
Just noting that for display in Jupyter Notebook there are differences between DataFrame.to_html, which uses the DataFrameFormatter, and Styler.to_html, which uses the native str method for objects that it renders via a jinja2 template, i.e. a difference between range objects and numpy arrays.
Eg.
I think changing the repr behavior with Sequences would be a large breaking change since this has been the behavior for a while. Moreover, generally storing Sequence-like objects in pandas is an anti-pattern and discouraged.
I think changing the repr behavior with Sequences would be a large breaking change since this has been the behavior for a while.
Well, the current behaviour makes pandas quite unusable in certain circumstances as I pointed out. What is better: Changing the way a pretty printing routine works, or have it not work at all?
Moreover, generally storing Sequence-like objects in pandas is an anti-pattern and discouraged.
I very strongly disagree. Why is it an anti-pattern? Shouldn't it be a design goal to have proper support for dtype=object
, even if the objects happen to be sequence-like?
I very strongly disagree. Why is it an anti-pattern? Shouldn't it be a design goal to have proper support for dtype=object, even if the objects happen to be sequence-like?
this was never a design goal. could it / should it. maybe, with enough community support. proper column typing is the key here, e.g. a List or Dict or JSON extension types could go a long way here.
I identified the following if-else code from pandas.io.formats.printing.pprint_thing
as being responsible for the current behaviour:
if hasattr(thing, "__next__"):
return str(thing)
elif isinstance(thing, dict) and _nest_lvl < get_option(
"display.pprint_nest_depth"
):
result = _pprint_dict(
thing, _nest_lvl, quote_strings=True, max_seq_items=max_seq_items
)
elif is_sequence(thing) and _nest_lvl < get_option("display.pprint_nest_depth"):
result = _pprint_seq(
thing,
_nest_lvl,
escape_chars=escape_chars,
quote_strings=quote_strings,
max_seq_items=max_seq_items,
)
elif isinstance(thing, str) and quote_strings:
result = f"'{as_escaped_string(thing)}'"
else:
result = as_escaped_string(thing)
and then pandas.io.formats.printing._pprint_seq
takes care of the sequence formatting. A couple of observations:
if hasattr(thing, "__next__"): return str(thing)
looks suspicious and should probably be replaced with isinstance(thing, collections.abc.Iterator)
isinstance(thing, dict)
should probably be replaced by isinstance(thing, collections.abc.Mapping)
is_sequence(thing)
should probably be replaced by isinstance(thing, collections.abc.Iterable)
and isinstance(thing, collections.abc.Sized)
pandas.core.dtypes.inference.is_sequence
just tests from __iter__
and __len__
My proposal would be to do replace this if-else check with the following:
if is_builtin(thing):
# possibly special formatting for builtin / otherwise known objects
result = format_builtin(thing)
elif hasattr(thing, "__repr__"):
# Do not reinvent the wheel, use the formatting provided by the object!
body = repr(thing)
# strip `\n`, limit to max_characters and escape string
result = apply_formatting(body)
else:
# some fallback option
result = as_escaped_string(thing)
Additionally, one could think about adding special treatment for large objects, e.g. if len(thing)
is much larger than max_characters
, there seems to be little point for iterating over the whole thing.
Also here are a couple more reasons why calling __iter__
on custom data types (or even builtin data types) stored in the Series/DataFrame might be a bad idea:
__iter__
might actually change the state of the object. That's a pretty bad one! This could happen for instance if the object behaves like a generator without implementing __next__
. For example, when storing a bunch of random samplers in a Series, printing the Series would change the state of the PRNG. This cannot be desired behavior!len(object)
is much larger than max_characters
then iterating over the whole object is very wasteful.
isinstance(thing, collections.abc.Sequence)
check and then only use __getitem__
on the first few items.__iter__
might actually never terminate, even if the object has a __len__
, for example if __iter__
cycles over the data.
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Gives
Instead of
Issue Description
When displaying a Series/DataFrame of
dtype=object
whose content conform to thecollections.abc.Sequence
protocol, pandas tries to iterate over these objects.To see why this is generally a bad idea, consider the following example:
Where would this occur in practice? Well in my case I tried to store some
torch.utils.data.DataLoader
objects in a Series in order to leverage the powerful pandas Multi-Indexing over hierarchical cross-validation splits. In this case, printing the Series in a Jupyter Notebook would take 5+ minutes, whereas instantiating it was practically instantaneous. This is especially problematic when using Jupyter with%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
mode.Expected Behavior
When
dtype=object
then pandas should use therepr
method of the object in order to get a string representation and not try to do something fancy. Possibly one can make some exception / special cases for python bulit-ins such astuple
andlist
. (I presume the current behaviour is the way it is to deal with these two when they hold lots of items)Installed Versions