ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.33k stars 599 forks source link

possible bug: Table.__repr__ sometimes produces non-ASCII characters #10516

Open edschofield opened 1 day ago

edschofield commented 1 day ago

What happened?

In interactive mode, the __repr__ for a Table differs depending on the run-time environment. The __repr__ includes Unicode characters and ANSI escape codes for colours when repr(table) is called from a Jupyter notebook but not when called from the Python or IPython interpreter.

For example, this code:

import ibis
ibis.options.interactive = True

url = "https://raw.githubusercontent.com/PythonCharmers/PythonCharmersData/refs/heads/master/palmerpenguins.csv"
penguins = ibis.read_csv(url)
print(len(repr(penguins)))

outputs 3262 when run from Jupyter but 1206 when run from IPython or as a regular Python script.

I find both of these facts surprising:

I expect that both are likely to cause problems in various workflows in ways that are hard to anticipate. But one simple example is when rendering notebooks via LaTeX. If the following cell appears in a notebook called ibis_str.ipynb along with its output:

import ibis
ibis.options.interactive = True

url = "https://raw.githubusercontent.com/PythonCharmers/PythonCharmersData/refs/heads/master/palmerpenguins.csv"
penguins = ibis.read_csv(url)
print(penguins)

then converting it to PDF as follows fails with a LaTeX error due to the use of Unicode characters:

jupyter nbconvert ---to latex ibis_str.ipynb
pdflatex ibis_str.tex

Another very surprising effect of the different code paths taken for __repr__ depending on the run-time environment is that this code:

import ibis
import polars as pl
ibis.options.interactive = True

url = "https://raw.githubusercontent.com/PythonCharmers/PythonCharmersData/refs/heads/master/palmerpenguins.csv"
penguins_pl = pl.read_csv(url)

penguins = ibis.memtable(penguins_pl)
output = repr(penguins)

currently fails with a very different exception when run in Python / IPython:

ParserException: Parser Error: zero-length delimited identifier at or near """"

versus the ValueError raised in a Jupyter notebook (as reported in issue #10514):

ValueError: Target schema's field names are not matching the table's field names: ...

I believe the standard approach would be for the Table class to have a single code path for __repr__ that produces the same ASCII string independent of the runtime environment and to define IPython-compatible methods like ._repr_pretty_, _repr_html_, and _repr_latex_ for fancier output in IPython and Jupyter.

If you agree that this would be an improvement, I can volunteer a PR as my first contribution to the project.

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

gforsyth commented 13 hours ago

Hey @edschofield -- thanks for the thorough report!

I haven't delved into the repr lately. It would probably be worth first clarifying which of these behaviors is coming from rich, which we use for table formatting, and which are coming from Ibis, or Ibis' (possibly odd) use of rich.

Also, I don't view the use of unicode as a bug. Even if our display skeleton didn't have any unicode characters, a great number of our backends will emit them in result-sets.