Open wesm opened 5 years ago
This is on git master
Update: this seems to only be happening inside a Docker container since it's failing to detect the console dimensions. I guess the "headless IPython" use case is pretty unusual
The workaround is to set pd.options.display.max_colwidth = 70
or something similar
Here's another one:
-In [163]: info[:5]
-Out[163]:
- description group id \
-0 Cheese, caraway Dairy and Egg Products 1008
-1 Cheese, cheddar Dairy and Egg Products 1009
-2 Cheese, edam Dairy and Egg Products 1018
-3 Cheese, feta Dairy and Egg Products 1019
-4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
- manufacturer
-0
-1
-2
-3
-4
+In [164]: info[:5]
+Out[164]:
+ description ... manufacturer
+0 Cheese, caraway ...
+1 Cheese, cheddar ...
+2 Cheese, edam ...
+3 Cheese, feta ...
+4 Cheese, mozzarella, part skim milk ...
What is the option to do the old behavior (print the overflowing columns on the next line)?
Appears I can make more of the problems going away with this (in case anyone stumbles on this thread later):
export COLUMNS=80
export LINES=50
stty cols $COLUMNS rows $LINES
Another example of odd formatting even though the console width is set to 80
.....: columns='smoker', margins=True)
-Out[132]:
- size tip_pct
-smoker No Yes All No Yes All
-time day
-Dinner Fri 2.000000 2.222222 2.166667 0.139622 0.165347 0.158916
- Sat 2.555556 2.476190 2.517241 0.158048 0.147906 0.153152
- Sun 2.929825 2.578947 2.842105 0.160113 0.187250 0.166897
- Thur 2.000000 NaN 2.000000 0.159744 NaN 0.159744
-Lunch Fri 3.000000 1.833333 2.000000 0.187735 0.188937 0.188765
- Thur 2.500000 2.352941 2.459016 0.160311 0.163863 0.161301
-All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803</programlisting>
+Out[133]:
+ size ... tip_pct
+smoker No Yes ... Yes All
+time day ...
+Dinner Fri 2.000000 2.222222 ... 0.165347 0.158916
+ Sat 2.555556 2.476190 ... 0.147906 0.153152
+ Sun 2.929825 2.578947 ... 0.187250 0.166897
+ Thur 2.000000 NaN ... NaN 0.159744
+Lunch Fri 3.000000 1.833333 ... 0.188937 0.188765
+ Thur 2.500000 2.352941 ... 0.163863 0.161301
+All 2.668874 2.408602 ... 0.163196 0.160803
+[7 rows x 6 columns]</programlisting>
Another example of a dataset that looks fine in 0.20.x and not in 0.23.x
-In [74]: data
-Out[74]:
- user_id movie_id rating timestamp gender age occupation zip \
-0 1 1193 5 978300760 F 1 10 48067
-1 2 1193 5 978298413 M 56 16 70072
-2 12 1193 4 978220179 M 25 12 32793
-3 15 1193 4 978199279 M 25 7 22903
-4 17 1193 5 978158471 M 50 1 95350
-... ... ... ... ... ... ... ... ...
-1000204 5949 2198 5 958846401 M 18 17 47901
-1000205 5675 2703 3 976029116 M 35 14 30030
-1000206 5780 2845 1 958153068 M 18 17 92886
-1000207 5851 3607 5 957756608 F 18 20 55410
-1000208 5938 2909 4 957273353 M 25 1 35401
- title genres
-0 One Flew Over the Cuckoo's Nest (1975) Drama
-1 One Flew Over the Cuckoo's Nest (1975) Drama
-2 One Flew Over the Cuckoo's Nest (1975) Drama
-3 One Flew Over the Cuckoo's Nest (1975) Drama
-4 One Flew Over the Cuckoo's Nest (1975) Drama
-... ... ...
-1000204 Modulations (1998) Documentary
-1000205 Broken Vessels (1998) Drama
-1000206 White Boys (1999) Drama
-1000207 One Little Indian (1973) Comedy|Drama|Western
-1000208 Five Wives, Three Secretaries and Me (1998) Documentary
-[1000209 rows x 10 columns]
+ <programlisting language="python" format="linespecific">In [74]: data = pd.merge(pd.merge(ratings, users), movies)
-In [75]: data.iloc[0]
+In [75]: data
Out[75]:
+ user_id ... genres
+0 1 ... Drama
+1 2 ... Drama
+2 12 ... Drama
+3 15 ... Drama
+4 17 ... Drama
+... ... ... ...
+1000204 5949 ... Documentary
+1000205 5675 ... Drama
+1000206 5780 ... Drama
+1000207 5851 ... Comedy|Drama|Western
+1000208 5938 ... Documentary
+[1000209 rows x 10 columns]
cc @jreback
I can try to dig into this, but if there's a temporary workaround it would be helpful to know. This is the last item blocking me releasing accumulated errata fixes to be published
Does doing pd.options.display.max_columns = 20
not give you what you are looking for, namely the behavior before #17023
Yes, doing that (pd.options.display.max_columns = 20
) should restore the old behaviour. See also the whatsnew entry here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#better-pretty-printing-of-dataframes-in-a-terminal
That said, I personally have had several times that the new truncating behaviour was somewhat annoying.
The workaround is to set pd.options.display.max_colwidth = 70 or something similar
I don't think this should influence that
FWIW I think the new behavior is worse. The old behavior is more consistent with what other projects (like R) do IIUC
I also think the assertion that it is "better" or that the old behavior "was relatively difficult to read" is truly in the eye of the beholder / very subjective
Just today a reader was confused by the default options
I think to your point before this is a rather subjective call. From personal experience I found the new display options worse at first but changed my stance over time.
Not sure how reasonable this is but does it make sense in the book somewhere to introduce the display options and maybe set the option to 20 globally to reduce the amount of changes?
That's what I'm planning to do...
In general, I would be hesitant to describe a subjective change such as this "better" -- it would be more accurate to say that the pandas developers' consensus was to change the formatting. Instead of saying that something is "difficult" or "easy" it would be better to say that "many users feel that..." or "some people have said that...". If other people feel differently than I do, I respect their opinion. Notice that I say "I think the behavior is worse" (my opinion) and not "The behavior is worse"
I am personally often getting into specific situations where the new default style is IMO quite a bit worse than before.
Eg compare this:
In [8]: import geopandas
...: world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
In [9]: pd.options.display.max_columns = 0 # default
In [17]: world.head()
Out[17]:
pop_est ...
geometry
0 28400000.0 ... POLYGON ((61.2
1081709172574 35.65007233330923,...
1 12799293.0 ... (POLYGON ((16.
32652835456705 -5.87747039146621...
2 3639453.0 ... POLYGON ((20.5
9024743010491 41.85540416113361,...
3 4798491.0 ... POLYGON ((51.5
7951867046327 24.24549713795111,...
4 40913584.0 ... (POLYGON ((-65
.50000000000003 -55.199999999999...
[5 rows x 6 columns]
with this (the old default):
In [18]: pd.options.display.max_columns = 20
In [19]: world.head()
Out[19]:
pop_est continent name iso_a3 gdp_md_est \
0 28400000.0 Asia Afghanistan AFG 22270.0
1 12799293.0 Africa Angola AGO 110300.0
2 3639453.0 Europe Albania ALB 21810.0
3 4798491.0 Asia United Arab Emirates ARE 184300.0
4 40913584.0 South America Argentina ARG 573900.0
geometry
0 POLYGON ((61.21081709172574 35.65007233330923,...
1 (POLYGON ((16.32652835456705 -5.87747039146621...
2 POLYGON ((20.59024743010491 41.85540416113361,...
3 POLYGON ((51.57951867046327 24.24549713795111,...
4 (POLYGON ((-65.50000000000003 -55.199999999999...
This might be quite specific to this use case (where the 'geometry' column is often a wide (truncated) column, so quickly gives a case where the columns don't fit on one line).
I would also say that the current behaviour is a bug: the output is too wide for the specific console size I was doing this in (and this size was correctly detected), so you get an overflow, although there is a lot of unneeded whitespace between the two columns.
@pandas-dev/pandas-core @cbrnr now the new default has been out there for a while, what are the opinions about it in general? Mostly an improvement?
(it's quite possible that the specific cases I am working with like the one above are not very representative, and that in many other cases it is an improvement)
I'd be in favor of reverting
For me the new behavior has worked quite well (standard terminal on macOS). However, it seems like there are still some rough edges in the way the number of columns is detected (and maybe in some environments this is even impossible). All counter-examples where the new behavior is worse are actually cases where automatic detection is not working correctly (or so it seems), because lines shouldn't overflow and the available columns should be used as efficiently as possible.
In cases where the number of columns cannot be detected, pandas should default to the old behavior of 20 columns. If headless IPython doesn't work and can be detected somehow, this case should be added to default to the old setting.
@wesm I agree that this is a subjective decision, and it is especially bad that the new behavior doesn't work as expected in many cases. I still think that if the new behavior works, it is a more convenient overview of a data frame. Note that one motivation of this change is that this is the default behavior in the Tidyverse - so whereas you are right that base R does behave like the old setting, Tidyverse packages (tibble) behave like the new setting (except that it seems to work more consistently because there is really only one IDE that people are using, namely RStudio).
I still think a .pandasrc
file would go a long way in making this debate obsolete, because then a default setting wouldn't be that important anymore.
Note: my example above is a case where the console width is correctly detected. But apparently there are cases where we generate a repr with the new default that is still wider than that.
I still think a .pandasrc file would go a long way in making this debate obsolete, because then a default setting wouldn't be that important anymore.
This is of course limited to ipython-based environments (console, notebook), but you can already put that in a startup script. And even if we have something like: many users are not aware of the fact they can change the display behaviour, and we still need to make sure the default behaviour (what users see at first) is a good one in general.
This is of course limited to ipython-based environments (console, notebook), but you can already put that in a startup script.
I just don't want to import pandas whenever I start IPython/Jupyter because I don't always use it.
And even if we have something like: many users are not aware of the fact they can change the display behaviour, and we still need to make sure the default behaviour (what users see at first) is a good one in general.
100% agreed. If the old setting works better for some corner cases, then the new setting should either be fixed (meaning that the repr should never exceed the console width; it looks like this could be related to how the column widths are determined), or pandas should revert to the old setting.
Other example that might be a bug, in a datacamp integrated console:
But this one is very strange, as I don't see this locally, even if I reduce my terminal width to 80, it still shows 6 out of 10 columns (since that is what is fitting in 80 char width). @cbrnr do you know, if we cannot detect the terminal width (which is the case here), it should default to 80 no?
I'm not sure. The behavior depends on pd.io.formats.terminal.is_terminal
- this function detects whether Python is IPython or not. If not, it assumes that it is in a terminal and thus returns True
(this might be problematic). If it detects IPython, it only returns True
if there is no Jupyter kernel attached, otherwise it returns False
. The new behavior is only activated if is_terminal
returns True
, otherwise the default is set to 20.
In the Datacamp shell, pd.io.formats.terminal.is_terminal()
returns True
, but pd.io.formats.terminal.get_terminal_size()
is os.terminal_size(columns=0, lines=0)
. I guess is_terminal
should take the output of get_terminal_size
into account and return False
if it has values of 0.
In the Datacamp shell, pd.io.formats.terminal.is_terminal() returns True, but pd.io.formats.terminal.get_terminal_size() is os.terminal_size(columns=0, lines=0). I guess is_terminal should take the output of get_terminal_size into account and return False if it has values of 0.
Yep, that seems to be the case. +1
In addition, even if we switch back to 20 columns in such cases, we still need a value for the number of columns. Is this currently 80? This might also lead to problems even with the old behavior if we guess the wrong value.
Normally, we fall back to pd.options.display.width
, which has a default of 80 (but so is also something you can set)
Working on book updates and was surprised to see pandas do this:
before
after
Why is the middle column being hidden in a small 3x3 data frame?