pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes

arturdaraujo commented 1 year ago

Problem description

The dataframe printing format should be densed. It does not make sense to make the full verbose format.

Here is an implementation suggestion using only built-in packages and their respective processing times below:

if sys.platform.startswith("linux"):  # could be "linux", "linux2", "linux3", ...
    pl.Config.set_tbl_formatting("ASCII_BORDERS_ONLY_CONDENSED")
elif sys.platform == "darwin":
     pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")
elif os.name == "not":
    pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")
    # Windows, Cygwin, etc. (either 32-bit or 64-bit)

%timeit os.name
32 ns ± 0.285 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit platform.system()
124 ns ± 1.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit sys.platform
33.1 ns ± 0.542 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I also created an issue for differentiating systems OS on https://github.com/pola-rs/polars/issues/5941

The current default prints like this:

import polars as pl
from datetime import datetime

df_stock = pl.DataFrame(
    {
        "date_time": pl.date_range(
            low=datetime(2000, 1, 1, 0, 0),
            high=datetime(2023, 1, 1, 0, 0),
            interval="1d",
        ).shuffle(seed=0)
    }
)

shape: (8402, 1)
┌─────────────────────┐
│ date_time           │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2008-05-13 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2000-11-24 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2005-08-09 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2009-09-04 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-06-30 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-08-18 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-02-06 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-04-16 00:00:00 │
└─────────────────────┘

But the default should be this:

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")

print(df_stock)
shape: (8402, 1)
┌─────────────────────┐
│ date_time           │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2008-05-13 00:00:00 │
│ 2000-11-24 00:00:00 │
│ 2005-08-09 00:00:00 │
│ 2009-09-04 00:00:00 │
│ ...                 │
│ 2010-06-30 00:00:00 │
│ 2010-08-18 00:00:00 │
│ 2010-02-06 00:00:00 │
│ 2010-04-16 00:00:00 │
└─────────────────────┘

arturdaraujo commented 1 year ago

Thanks for the try @dandxy89. I don't know what went wrong with your PR

alexander-beedie commented 1 year ago

This seems like it could be an issue with your specific distribution/terminal/font - I don't have any problems with the UTF8 formats on Linux.

Can you provide a screenshot and some distro/terminal/font details for your Linux machine, so we can at least see what's happening for you? (We're not going to want to change the default to ASCII there, given that there are no other reports of problems).

arturdaraujo commented 1 year ago

If you change the default to UTF8_FULL_CONDENSED the problems will appear on bigger dataframes (more columns). However, it is often only small mismatches. UTF8_FULL_CONDENSED is certainly good enough but the ideal would be to check the system for appropriate formatting. UTF8_FULL as default is just a mess on Linux.

The current default:

The UTF8_FULL_CONDENSED formatting:

---Version info---
Polars: 0.15.7
Index type: UInt32
Platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.10
Python: 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35) 
[GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: 3.6.2

arturdaraujo commented 1 year ago

I just tested it on my Linux Mint 21.1 and it's working fine with utf8 formatting, the problem only occurs on WSL (Windows Subsystem for Linux). Still, I believe that the condensed utf8 formatting should be the default.

stinodego commented 1 year ago

Personally, I also prefer the look of the condensed formatting. I am not familiar with the intricacies of the various operating systems, but to me it just looks better.

dandxy89 commented 1 year ago

Will update my branch this evening once I'm back from the bouldering gym.

Why did I close it? I realised that the Python test suite wasn't running locally.

arturdaraujo commented 1 year ago

@stinodego is not only the looks, it is around 40% less lines and it makes it Polars DataFrame much easier to work with. Pandas, which is the "default" library for DataFrame, is even more minimal.

stinodego commented 1 year ago

It will probably be a bit of a chore to update all the docstring examples, although I'm sure that's doable with some regex magic. Should probably get sign-off from @ritchie46 before going ahead with that.

ritchie46 commented 1 year ago

It will probably be a bit of a chore to update all the docstring examples, although I'm sure that's doable with some regex magic. Should probably get sign-off from @ritchie46 before going ahead with that.

Do you think we can automate such a change? This would be beneficial many times down the road probably.

I have no strong opinions on the format, so a tight format is fine by me. Only on changing the docstrings.

alexander-beedie commented 1 year ago

The line-contraction is definitely a font/kerning issue, rather than a polars issue (@arturdaraujo: you could try switching your font under WSL to something like JetBrains Mono, or one of the Menlo Powerline fonts?)

As for a change in default format, I'm definitely biased towards the UTF8_FULL_CONDENSED, not least because I introduced it to comfy-table for use by polars in the first place, heh... It's a big win space-wise, and I don't think readability suffers.

@ritchie46 / @stinodego: if you want to go ahead, I'll volunteer for updating this and the associated mass docstrings update; how about in conjunction with https://github.com/pola-rs/polars/issues/5513, which would also require a global docstring update? (It's a different issue, but could kill two birds with one stone).

regular_expressions

stinodego commented 1 year ago

Do you think we can automate such a change? This would be beneficial many times down the road probably.

I doubt it would be worth it. The docstrings right now are all flat text, you'd have to find some way to auto-generate the example output. I think regex magic is probably a way easier method of going about this.

if you want to go ahead, I'll volunteer for updating this

Let's do it! Regarding #5513, I'll leave a comment in that issue. I'm not as certain about that one.

cmdlineluser commented 1 year ago

If it's for the python files - the ast module can be used to isolate the docstrings which can be helpful for performing modifications.

import ast
import re

def remove_row_sep_from_docstring(
    text,
    col_sep = "┼",
    dash = "╌",
    df_start = "┌",
    df_end = "┘",
    row_start = "├",
    row_end = "┤",
    node_types = (ast.ClassDef, ast.FunctionDef)
):

    examples_re = r"(?s)\s*Examples\n\s*-+.+"
    dataframe_re = (
        rf"(?s)\s*shape: [(]\d+, \d+[)]\s*"
        rf"{df_start}.+?{df_end}"
        rf"(?=]?\n)"
    )
    row_sep_re = (
        rf"\n\s*{row_start}{dash}+"
        rf"(?:{col_sep}{dash}+)*{row_end}"
        rf"(?=\n)"
    )

    tree = ast.parse(text)

    for node in ast.walk(tree):
        if isinstance(node, node_types):
            doc = ast.get_docstring(node, clean=False) or ""
            has_examples = re.search(examples_re, doc)
            if has_examples:
                old_examples = has_examples.group()
                new_examples = old_examples
                for df in re.findall(dataframe_re, old_examples):
                    new_examples = new_examples.replace(
                        df,
                        re.sub(row_sep_re, "", df)
                    )
                doc = doc.replace(old_examples, new_examples)
                # https://github.com/python/cpython/blob/3.11/Lib/ast.py#L294
                node.body[0].value.value = doc

    return ast.unparse(tree)

The result could then be passed to black to fix any lost formatting.

ritchie46 commented 1 year ago

Does unparse do what I think it does?! 👀 That would be amazing.

alexander-beedie commented 1 year ago

Just submitted a corresponding ASCII_FULL_CONDENSED to comfy-table; bit of an oversight on my part not doing that when I added UTF8_FULL_CONDENSED ... (Don't think we'll need to wait for it though; should be possible to declare the preset locally, then swap it out later).

zundertj commented 1 year ago

FYI, I have looked into alternatives before to auto-update the docstring output, and could only find https://github.com/max-sixty/pytest-accept, but does not seem maintained, and relies on pytest, which does not pick up our custom output checker (which we need for IGNORE_RESULT. We have been through this before, and at some point turned it off, only to find later that we had very quickly broken some examples. So yes it is a pain, but having working examples in docstrings seems key to me.

The suggestion by @cmdlineluser to use the ast module is a good one. Note that Python's built-in doctest does not use the ast module, although a popular alternative xdoctest (insofar doctests are popular) does. Not sure if that has any consequences. I have been tempted before to wrap what we have in our script as a separate package, maybe it becomes worth it if we have auto-update as well. Although I realize that is far off from the very targeted use case here.

Not sure how black should be part of this, you are only changing the outputs, right?

cmdlineluser commented 1 year ago

ast.parse doesn't retain the original formatting:

print(ast.unparse(ast.parse("""
def func(
    foo=1,
    bar=2
) -> DataFrame:
    ...
""")))

Output:

def func(foo=1, bar=2) -> DataFrame:
    ...

Also, I just realized it doesn't retain comments either - which makes it a non-solution - d'oh.

cmdlineluser commented 1 year ago

Looks like Parso or LibCST are the recommended tools that retain formatting/comments.

import parso
import re

def remove_row_sep_from_docstring(
    text,
    row_start = "├",
    row_end = "┤"
):
    examples_re = r"(?s)[^\S\n]*Examples\n\s*-+.+"

    tree = parso.parse(text)
    nodes = []
    for cls in tree.iter_classdefs():
        nodes.append(cls)
        nodes.extend(cls.iter_funcdefs())
    nodes.extend(tree.iter_funcdefs())

    for node in nodes:
        docnode = node.get_doc_node()
        if not docnode: 
            continue
        docstring = docnode.value
        has_examples = re.search(examples_re, docstring)
        if has_examples:
            old_examples = has_examples.group()
            new_examples = []
            for line in old_examples.splitlines():
                stripped = line.strip().strip("#").strip()
                if stripped.startswith(row_start) and stripped.endswith(row_end):
                    continue
                new_examples.append(line)
            docnode.value = docstring.replace(
                old_examples,
                "\n".join(new_examples)
            )

    return tree.get_code()

alexander-beedie commented 1 year ago

It's a New Year's gift! comfy-table made a fresh release just for the ASCII_FULL_CONDENSED preset I submitted last night. Will pull the latest code, brew a large cup of tea, and make the update; with a little luck it should be quite painless...

alexander-beedie commented 1 year ago

Have finished updating everything now except for the rather involved docstring of pl.align_frames. The tea is finished, and this one probably calls for a coffee - but then it's all done ;)

ritchie46 commented 1 year ago

Have finished updating everything now except for the rather involved docstring of pl.align_frames. The tea is finished, and this one probably calls for a coffee - but then it's all done ;)

Wow, that's fast!

alexander-beedie commented 1 year ago

Done - unit/doctests all passed locally, so let's see how it does through CI ... ;)

pola-rs / polars

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes #5942

Problem description