pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.66k stars 1.9k forks source link

Do not enclose str items in quotes in HTML repr #18102

Open cbrnr opened 2 months ago

cbrnr commented 2 months ago

Description

The normal pl.DataFrame repr shows items in a str column without quotes, which I think is a good idea because (1) it takes up less space and (2) the column type is always shown anyway (and (3) it is consistent with how other packages visualize string items).

import polars as pl

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5],
    "b": ["A", "B", "C", "D", "E"],
    "c": [1.1, -2.8, 3.4, 4.7, -5.9],
})

df
shape: (5, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ f64  │
╞═════╪═════╪══════╡
│ 1   ┆ A   ┆ 1.1  │
│ 2   ┆ B   ┆ -2.8 │
│ 3   ┆ C   ┆ 3.4  │
│ 4   ┆ D   ┆ 4.7  │
│ 5   ┆ E   ┆ -5.9 │
└─────┴─────┴──────┘

However, the HTML representation in a Jupyter notebook wraps every str item in double quotes:

Screenshot 2024-08-08 at 16 42 08

This is inconsistent with the normal repr and arguably worse for the reasons mentioned above. Therefore, I suggest to not wrap string items in quotes in the HTML repr as well.

mcrumiller commented 2 months ago

Strongly agree that not having quotes is visually more appealing.

deanm0000 commented 2 months ago

I agree that it's visually more appealing BUT I'm so used to them now that I find them nice. One example is wanting to filter by some value so copy/paste includes the quotes which is just a slight convenience.

As an aside, if you want to make it look nice for consumption then try great_tables.

cbrnr commented 2 months ago

That's a great package! However, this issue is not about creating publication-quality tables, but providing a consistent¹ and visually appealing² default HTML representation. Sure it can be convenient in some cases, but I'd think that these are very likely less frequently encountered than the reasons that speak for removing them.

¹consistent not only with the normal Polars repr, but also with other packages like pandas, DataFrames.jl, Tibble, etc. ²visually appealing also implies making it easy for users to extract relevant information from the output, and quotes around every string item certainly make it harder to do that IMO

cbrnr commented 2 months ago

PS: Maybe a compromise here could be to include quotes for a Series, but to not have them in a DataFrame?

etiennebacher commented 2 months ago

I don't have a strong opinion on this, but I just want to mention that quotation marks are useful to distinguish strings that have only whitespaces but a different number of them. For example, if you remove quotation marks, can you still distinguish "" and " " in the HTML repr?

For reference, the tidyverse print methods in R only adds quotation marks when there is at least one empty string or with only whitespace:

tibble::tibble(a = c("a", "b", "c"))
#> # A tibble: 3 × 1
#>   a    
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 c
tibble::tibble(a = c("", "  ", "a"))
#> # A tibble: 3 × 1
#>   a    
#>   <chr>
#> 1 ""   
#> 2 "  " 
#> 3 "a"
tibble::tibble(a = c("", "", "a"))
#> # A tibble: 3 × 1
#>   a    
#>   <chr>
#> 1 ""   
#> 2 ""   
#> 3 "a"
cbrnr commented 2 months ago

I like the Tidyverse way, I didn't know that! Whatever Polars ends up deciding, I think it's also be important to have consistent repr and html_repr, which is currently not the case.

cbrnr commented 2 months ago

Note that by "empty" strings the Tidyverse really means "whitespace-only" strings, so this also adds quotes:

> tibble::tibble(a = c(" ", "b", "c"))
#> # A tibble: 3 × 1
#>   a    
#>   <chr>
#> 1 " "  
#> 2 "b"  
#> 3 "c" 

Furthermore, the quotes are printed in gray (and not in black) so that they don't distract that much, which I also like!

Screenshot 2024-08-09 at 09 59 23
liufeimath commented 1 month ago

I indeed like the quoted style since it's easier to tell it's a string. Otherwise it's harder to distinguish "1" v.s. 1. Check this example:

Screenshot 2024-08-14 at 9 01 45 AM

But taking a step back, I do think it's important to make the style consistent across str and html formats, either one way (with quote) or the other (without quote).

mcrumiller commented 1 month ago

If quotes are used, I think that single quotes look a bit better when reading strings. This is how python shows strings:

>>> "hello"
'hello'
liufeimath commented 1 month ago

If quotes are used, I think that single quotes look a bit better when reading strings. This is how python shows strings:

>>> "hello"
'hello'

Yeah that's just what python chose as the default display, which many people hate (and is not consistent with almost every major programming languages out there). I think Polars uses double quotes pretty much every where and that's what Polars choose. I think it's a right choice. Pyarrow also does so (double quote). The default option of popular code formatters (ruff, black) also go with double quote. But taking a step back, I still want to emphasize on consistency, if polars choose single-quote, it's fine, but it needs to be single-quote everywhere (displays, python manuals, documentations, user guides, etc). Mixing quote styles is a bad habit imo.

mcrumiller commented 1 month ago

Sorry, I'm not talking about using double quotes in code as a string--there I prefer double quotes. But when displaying output, I find double quotes to be a little bit "noisy" when there are a lot of them. But that's just me and I don't have much vested interest one way or another.

cbrnr commented 1 month ago

I think displaying items without quotes is much more readable, consistent with the regular repr, and consistent with other dataframe implementations (pandas, DataFrames.jl, data.frame, Tibble). Disambiguating a numeric from a str column is still pretty easy because the data type of each column is shown. And if that's not enough, why not go the Tibble route and only show the quotes if there's at least one whitespace-only string item in the column?

mcrumiller commented 1 month ago

I believe the issue is that it's hard to tell if a string has whitespace at the edges. abc and abc will both look the same.

One option is as you suggest there, but perhaps only use quotes if the string has whitespace at the edges.

Another option is to use · to render whitespace, which is fairly common, and which I think may work best, as in:

A B
hi·there ··space·on·left
nope test
cbrnr commented 1 month ago

Yes, why not, then quotes are never needed I guess? I'd still follow what other packages have already implemented, the Tibble solution is both unintrusive and seems to work well.

liufeimath commented 1 month ago

Disambiguating a numeric from a str column is still pretty easy because the data type of each column is shown.

By the same logic I can say "we should remove the .0 in the float number print output since the type already says it's float, not int, and the extra .0 is just noisy". This is not a good argument, the type in the header info is not a "local info", and you don't expect user to figure that out with the necessity to move eyeball ten lines above to look for type info in the header. Same reason goes to why the "public:/private:" specifier in c++ is very bad, it's just not local enough, and costing reader tens or hundreds of line moves to look for the "environment variable" to figure out the attribute of an entity. In contrast, Rust is much better since you can tell the pub/pri attribute simply by the pub keyword presence before the function declaration. That's a bit digress but the same argument applies here as well, if I can tell by the quote that it's a string not a int, why forcing the user to look at the header, which could be many lines above? A string is a string, an int is an int, a float is a float. Let it be self-document enough. Let alone all the spacing issues mentioned above.

But again, taking a step back, I think it's crucial to be consistent, which is way more important than my argument above. One way or the other, we need to pick one. The quoting inconsistency across str and html formats does need a fix.

cbrnr commented 1 month ago

Yes, and the fix should be no quotes IMO.

hoechenberger commented 1 month ago

Whatever the decision (quotes or no quotes), I'd appreciate it if it was consistent between HTML and string repr :)

agossard commented 1 month ago

Could this be a configuration setting the you can toggle with pl.Config, which already controls other aspects of how a dataframe is represented in Jupyter?

deanm0000 commented 1 month ago

I think there's a case to keep them inconsistent between str and html. The str representation is monospaced so having 2 quotes takes up a lot more space on the screen than the html version so it's not so cumbersome to have quotes in the html representation. Just to flag this...if the str representation gets changed then the pl.from_repr function needs to be updated.

liufeimath commented 1 month ago

I think there's a case to keep them inconsistent between str and html. The str representation is monospaced so having 2 quotes takes up a lot more space on the screen than the html version so it's not so cumbersome to have quotes in the html representation. Just to flag this...if the str representation gets changed then the pl.from_repr function needs to be updated.

This is a really weak argument to defend the inconsistency.

I agree that pl.from_repr also needs a change. My suggestion is that the Series/DataFrame str representations should emulate what the html representations do, where Series is just treated as a 1-column DataFrame with minimal modifications (the shape-tuple)(code pointers here and here). I mentioned this in another issue related to this one. Overall I think it's much easier if we can just unify the representation styles and Series/DataFrame logic treatments, and it should help both in terms of consistency and beauty, as well as easier code maintenance.

deanm0000 commented 1 month ago

I'm certainly biased towards the status quo, I won't deny that. If you erased my memory then I can't imagine defending inconsistency. While I acknowledge that, I like the quotes in vsc and I like being able to do print and get the monospacing output without quotes. If you told me I had to choose both with quotes or both without I don't know which I'd prefer.