Closed arturdaraujo closed 1 year ago
Thanks for the try @dandxy89. I don't know what went wrong with your PR
This seems like it could be an issue with your specific distribution/terminal/font - I don't have any problems with the UTF8 formats on Linux.
Can you provide a screenshot and some distro/terminal/font details for your Linux machine, so we can at least see what's happening for you? (We're not going to want to change the default to ASCII there, given that there are no other reports of problems).
If you change the default to UTF8_FULL_CONDENSED the problems will appear on bigger dataframes (more columns). However, it is often only small mismatches. UTF8_FULL_CONDENSED is certainly good enough but the ideal would be to check the system for appropriate formatting. UTF8_FULL as default is just a mess on Linux.
The current default:
The UTF8_FULL_CONDENSED formatting:
---Version info---
Polars: 0.15.7
Index type: UInt32
Platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.10
Python: 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35)
[GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: 3.6.2
I just tested it on my Linux Mint 21.1 and it's working fine with utf8 formatting, the problem only occurs on WSL (Windows Subsystem for Linux). Still, I believe that the condensed utf8 formatting should be the default.
Personally, I also prefer the look of the condensed formatting. I am not familiar with the intricacies of the various operating systems, but to me it just looks better.
Will update my branch this evening once I'm back from the bouldering gym.
Why did I close it? I realised that the Python test suite wasn't running locally.
@stinodego is not only the looks, it is around 40% less lines and it makes it Polars DataFrame much easier to work with. Pandas, which is the "default" library for DataFrame, is even more minimal.
It will probably be a bit of a chore to update all the docstring examples, although I'm sure that's doable with some regex magic. Should probably get sign-off from @ritchie46 before going ahead with that.
It will probably be a bit of a chore to update all the docstring examples, although I'm sure that's doable with some regex magic. Should probably get sign-off from @ritchie46 before going ahead with that.
Do you think we can automate such a change? This would be beneficial many times down the road probably.
I have no strong opinions on the format, so a tight format is fine by me. Only on changing the docstrings.
The line-contraction is definitely a font/kerning issue, rather than a polars issue (@arturdaraujo: you could try switching your font under WSL to something like JetBrains Mono, or one of the Menlo Powerline fonts?)
As for a change in default format, I'm definitely biased towards the UTF8_FULL_CONDENSED
, not least because I introduced it to comfy-table for use by polars in the first place, heh... It's a big win space-wise, and I don't think readability suffers.
@ritchie46 / @stinodego: if you want to go ahead, I'll volunteer for updating this and the associated mass docstrings update; how about in conjunction with https://github.com/pola-rs/polars/issues/5513, which would also require a global docstring update? (It's a different issue, but could kill two birds with one stone).
Do you think we can automate such a change? This would be beneficial many times down the road probably.
I doubt it would be worth it. The docstrings right now are all flat text, you'd have to find some way to auto-generate the example output. I think regex magic is probably a way easier method of going about this.
if you want to go ahead, I'll volunteer for updating this
Let's do it! Regarding #5513, I'll leave a comment in that issue. I'm not as certain about that one.
If it's for the python files - the ast module can be used to isolate the docstrings which can be helpful for performing modifications.
import ast
import re
def remove_row_sep_from_docstring(
text,
col_sep = "┼",
dash = "╌",
df_start = "┌",
df_end = "┘",
row_start = "├",
row_end = "┤",
node_types = (ast.ClassDef, ast.FunctionDef)
):
examples_re = r"(?s)\s*Examples\n\s*-+.+"
dataframe_re = (
rf"(?s)\s*shape: [(]\d+, \d+[)]\s*"
rf"{df_start}.+?{df_end}"
rf"(?=]?\n)"
)
row_sep_re = (
rf"\n\s*{row_start}{dash}+"
rf"(?:{col_sep}{dash}+)*{row_end}"
rf"(?=\n)"
)
tree = ast.parse(text)
for node in ast.walk(tree):
if isinstance(node, node_types):
doc = ast.get_docstring(node, clean=False) or ""
has_examples = re.search(examples_re, doc)
if has_examples:
old_examples = has_examples.group()
new_examples = old_examples
for df in re.findall(dataframe_re, old_examples):
new_examples = new_examples.replace(
df,
re.sub(row_sep_re, "", df)
)
doc = doc.replace(old_examples, new_examples)
# https://github.com/python/cpython/blob/3.11/Lib/ast.py#L294
node.body[0].value.value = doc
return ast.unparse(tree)
The result could then be passed to black
to fix any lost formatting.
Does unparse do what I think it does?! 👀 That would be amazing.
Just submitted a corresponding ASCII_FULL_CONDENSED
to comfy-table; bit of an oversight on my part not doing that when I added UTF8_FULL_CONDENSED
... (Don't think we'll need to wait for it though; should be possible to declare the preset locally, then swap it out later).
FYI, I have looked into alternatives before to auto-update the docstring output, and could only find https://github.com/max-sixty/pytest-accept, but does not seem maintained, and relies on pytest, which does not pick up our custom output checker (which we need for IGNORE_RESULT
. We have been through this before, and at some point turned it off, only to find later that we had very quickly broken some examples. So yes it is a pain, but having working examples in docstrings seems key to me.
The suggestion by @cmdlineluser to use the ast module is a good one. Note that Python's built-in doctest does not use the ast module, although a popular alternative xdoctest (insofar doctests are popular) does. Not sure if that has any consequences. I have been tempted before to wrap what we have in our script as a separate package, maybe it becomes worth it if we have auto-update as well. Although I realize that is far off from the very targeted use case here.
Not sure how black should be part of this, you are only changing the outputs, right?
ast.parse
doesn't retain the original formatting:
print(ast.unparse(ast.parse("""
def func(
foo=1,
bar=2
) -> DataFrame:
...
""")))
Output:
def func(foo=1, bar=2) -> DataFrame:
...
Also, I just realized it doesn't retain comments either - which makes it a non-solution - d'oh.
Looks like Parso or LibCST are the recommended tools that retain formatting/comments.
import parso
import re
def remove_row_sep_from_docstring(
text,
row_start = "├",
row_end = "┤"
):
examples_re = r"(?s)[^\S\n]*Examples\n\s*-+.+"
tree = parso.parse(text)
nodes = []
for cls in tree.iter_classdefs():
nodes.append(cls)
nodes.extend(cls.iter_funcdefs())
nodes.extend(tree.iter_funcdefs())
for node in nodes:
docnode = node.get_doc_node()
if not docnode:
continue
docstring = docnode.value
has_examples = re.search(examples_re, docstring)
if has_examples:
old_examples = has_examples.group()
new_examples = []
for line in old_examples.splitlines():
stripped = line.strip().strip("#").strip()
if stripped.startswith(row_start) and stripped.endswith(row_end):
continue
new_examples.append(line)
docnode.value = docstring.replace(
old_examples,
"\n".join(new_examples)
)
return tree.get_code()
It's a New Year's gift! comfy-table
made a fresh release just for the ASCII_FULL_CONDENSED
preset I submitted last night. Will pull the latest code, brew a large cup of tea, and make the update; with a little luck it should be quite painless...
Have finished updating everything now except for the rather involved docstring of pl.align_frames. The tea is finished, and this one probably calls for a coffee - but then it's all done ;)
Have finished updating everything now except for the rather involved docstring of pl.align_frames. The tea is finished, and this one probably calls for a coffee - but then it's all done ;)
Wow, that's fast!
Done - unit/doctests all passed locally, so let's see how it does through CI ... ;)
Problem description
The dataframe printing format should be densed. It does not make sense to make the full verbose format.
Here is an implementation suggestion using only built-in packages and their respective processing times below:
I also created an issue for differentiating systems OS on https://github.com/pola-rs/polars/issues/5941
The current default prints like this:
But the default should be this: