Open cbrnr opened 2 months ago
Strongly agree that not having quotes is visually more appealing.
I agree that it's visually more appealing BUT I'm so used to them now that I find them nice. One example is wanting to filter by some value so copy/paste includes the quotes which is just a slight convenience.
As an aside, if you want to make it look nice for consumption then try great_tables.
That's a great package! However, this issue is not about creating publication-quality tables, but providing a consistent¹ and visually appealing² default HTML representation. Sure it can be convenient in some cases, but I'd think that these are very likely less frequently encountered than the reasons that speak for removing them.
¹consistent not only with the normal Polars repr, but also with other packages like pandas, DataFrames.jl, Tibble, etc. ²visually appealing also implies making it easy for users to extract relevant information from the output, and quotes around every string item certainly make it harder to do that IMO
PS: Maybe a compromise here could be to include quotes for a Series, but to not have them in a DataFrame?
I don't have a strong opinion on this, but I just want to mention that quotation marks are useful to distinguish strings that have only whitespaces but a different number of them. For example, if you remove quotation marks, can you still distinguish ""
and " "
in the HTML repr?
For reference, the tidyverse
print methods in R only adds quotation marks when there is at least one empty string or with only whitespace:
tibble::tibble(a = c("a", "b", "c"))
#> # A tibble: 3 × 1
#> a
#> <chr>
#> 1 a
#> 2 b
#> 3 c
tibble::tibble(a = c("", " ", "a"))
#> # A tibble: 3 × 1
#> a
#> <chr>
#> 1 ""
#> 2 " "
#> 3 "a"
tibble::tibble(a = c("", "", "a"))
#> # A tibble: 3 × 1
#> a
#> <chr>
#> 1 ""
#> 2 ""
#> 3 "a"
I like the Tidyverse way, I didn't know that! Whatever Polars ends up deciding, I think it's also be important to have consistent repr and html_repr, which is currently not the case.
Note that by "empty" strings the Tidyverse really means "whitespace-only" strings, so this also adds quotes:
> tibble::tibble(a = c(" ", "b", "c"))
#> # A tibble: 3 × 1
#> a
#> <chr>
#> 1 " "
#> 2 "b"
#> 3 "c"
Furthermore, the quotes are printed in gray (and not in black) so that they don't distract that much, which I also like!
I indeed like the quoted style since it's easier to tell it's a string. Otherwise it's harder to distinguish "1" v.s. 1. Check this example:
But taking a step back, I do think it's important to make the style consistent across str and html formats, either one way (with quote) or the other (without quote).
If quotes are used, I think that single quotes look a bit better when reading strings. This is how python shows strings:
>>> "hello"
'hello'
If quotes are used, I think that single quotes look a bit better when reading strings. This is how python shows strings:
>>> "hello" 'hello'
Yeah that's just what python chose as the default display, which many people hate (and is not consistent with almost every major programming languages out there). I think Polars uses double quotes pretty much every where and that's what Polars choose. I think it's a right choice. Pyarrow also does so (double quote). The default option of popular code formatters (ruff, black) also go with double quote. But taking a step back, I still want to emphasize on consistency, if polars choose single-quote, it's fine, but it needs to be single-quote everywhere (displays, python manuals, documentations, user guides, etc). Mixing quote styles is a bad habit imo.
Sorry, I'm not talking about using double quotes in code as a string--there I prefer double quotes. But when displaying output, I find double quotes to be a little bit "noisy" when there are a lot of them. But that's just me and I don't have much vested interest one way or another.
I think displaying items without quotes is much more readable, consistent with the regular repr, and consistent with other dataframe implementations (pandas, DataFrames.jl, data.frame, Tibble). Disambiguating a numeric from a str column is still pretty easy because the data type of each column is shown. And if that's not enough, why not go the Tibble route and only show the quotes if there's at least one whitespace-only string item in the column?
I believe the issue is that it's hard to tell if a string has whitespace at the edges. abc
and abc
will both look the same.
One option is as you suggest there, but perhaps only use quotes if the string has whitespace at the edges.
Another option is to use ·
to render whitespace, which is fairly common, and which I think may work best, as in:
A | B |
---|---|
hi·there |
··space·on·left |
nope |
test |
Yes, why not, then quotes are never needed I guess? I'd still follow what other packages have already implemented, the Tibble solution is both unintrusive and seems to work well.
Disambiguating a numeric from a str column is still pretty easy because the data type of each column is shown.
By the same logic I can say "we should remove the .0 in the float number print output since the type already says it's float, not int, and the extra .0 is just noisy". This is not a good argument, the type in the header info is not a "local info", and you don't expect user to figure that out with the necessity to move eyeball ten lines above to look for type info in the header. Same reason goes to why the "public:/private:" specifier in c++ is very bad, it's just not local enough, and costing reader tens or hundreds of line moves to look for the "environment variable" to figure out the attribute of an entity. In contrast, Rust is much better since you can tell the pub/pri attribute simply by the pub
keyword presence before the function declaration. That's a bit digress but the same argument applies here as well, if I can tell by the quote that it's a string not a int, why forcing the user to look at the header, which could be many lines above? A string is a string, an int is an int, a float is a float. Let it be self-document enough. Let alone all the spacing issues mentioned above.
But again, taking a step back, I think it's crucial to be consistent, which is way more important than my argument above. One way or the other, we need to pick one. The quoting inconsistency across str and html formats does need a fix.
Yes, and the fix should be no quotes IMO.
Whatever the decision (quotes or no quotes), I'd appreciate it if it was consistent between HTML and string repr :)
Could this be a configuration setting the you can toggle with pl.Config, which already controls other aspects of how a dataframe is represented in Jupyter?
I think there's a case to keep them inconsistent between str and html. The str representation is monospaced so having 2 quotes takes up a lot more space on the screen than the html version so it's not so cumbersome to have quotes in the html representation. Just to flag this...if the str representation gets changed then the pl.from_repr
function needs to be updated.
I think there's a case to keep them inconsistent between str and html. The str representation is monospaced so having 2 quotes takes up a lot more space on the screen than the html version so it's not so cumbersome to have quotes in the html representation. Just to flag this...if the str representation gets changed then the
pl.from_repr
function needs to be updated.
This is a really weak argument to defend the inconsistency.
I agree that pl.from_repr
also needs a change. My suggestion is that the Series/DataFrame str representations should emulate what the html representations do, where Series is just treated as a 1-column DataFrame with minimal modifications (the shape-tuple)(code pointers here and here). I mentioned this in another issue related to this one. Overall I think it's much easier if we can just unify the representation styles and Series/DataFrame logic treatments, and it should help both in terms of consistency and beauty, as well as easier code maintenance.
I'm certainly biased towards the status quo, I won't deny that. If you erased my memory then I can't imagine defending inconsistency. While I acknowledge that, I like the quotes in vsc and I like being able to do print and get the monospacing output without quotes. If you told me I had to choose both with quotes or both without I don't know which I'd prefer.
Description
The normal
pl.DataFrame
repr shows items in astr
column without quotes, which I think is a good idea because (1) it takes up less space and (2) the column type is always shown anyway (and (3) it is consistent with how other packages visualize string items).However, the HTML representation in a Jupyter notebook wraps every str item in double quotes:
This is inconsistent with the normal repr and arguably worse for the reasons mentioned above. Therefore, I suggest to not wrap string items in quotes in the HTML repr as well.