pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.65k stars 17.58k forks source link

BUG: Weird console formatting defaults #22524

Open wesm opened 5 years ago

wesm commented 5 years ago

Working on book updates and was surprised to see pandas do this:

before

 In [99]: frame
 Out[99]: 
-   Ohio  Texas  California
-a     0      1           2
-c     3      4           5
-d     6      7           8

after

+   Ohio     ...      California
+a     0     ...               2
+c     3     ...               5
+d     6     ...               8
+[3 rows x 3 columns]

Why is the middle column being hidden in a small 3x3 data frame?

wesm commented 5 years ago

This is on git master

wesm commented 5 years ago

Update: this seems to only be happening inside a Docker container since it's failing to detect the console dimensions. I guess the "headless IPython" use case is pretty unusual

wesm commented 5 years ago

The workaround is to set pd.options.display.max_colwidth = 70 or something similar

wesm commented 5 years ago

Here's another one:

-In [163]: info[:5]
-Out[163]: 
-                          description                   group    id  \
-0                     Cheese, caraway  Dairy and Egg Products  1008   
-1                     Cheese, cheddar  Dairy and Egg Products  1009   
-2                        Cheese, edam  Dairy and Egg Products  1018   
-3                        Cheese, feta  Dairy and Egg Products  1019   
-4  Cheese, mozzarella, part skim milk  Dairy and Egg Products  1028   
-  manufacturer  
-0               
-1               
-2               
-3               
-4               
+In [164]: info[:5]
+Out[164]: 
+                          description     ...      manufacturer
+0                     Cheese, caraway     ...                  
+1                     Cheese, cheddar     ...                  
+2                        Cheese, edam     ...                  
+3                        Cheese, feta     ...                  
+4  Cheese, mozzarella, part skim milk     ...                  

What is the option to do the old behavior (print the overflowing columns on the next line)?

wesm commented 5 years ago

Appears I can make more of the problems going away with this (in case anyone stumbles on this thread later):

export COLUMNS=80
export LINES=50
stty cols $COLUMNS rows $LINES
wesm commented 5 years ago

Another example of odd formatting even though the console width is set to 80

    .....:                  columns='smoker', margins=True)
-Out[132]: 
-                 size                       tip_pct                    
-smoker             No       Yes       All        No       Yes       All
-time   day                                                             
-Dinner Fri   2.000000  2.222222  2.166667  0.139622  0.165347  0.158916
-       Sat   2.555556  2.476190  2.517241  0.158048  0.147906  0.153152
-       Sun   2.929825  2.578947  2.842105  0.160113  0.187250  0.166897
-       Thur  2.000000       NaN  2.000000  0.159744       NaN  0.159744
-Lunch  Fri   3.000000  1.833333  2.000000  0.187735  0.188937  0.188765
-       Thur  2.500000  2.352941  2.459016  0.160311  0.163863  0.161301
-All          2.668874  2.408602  2.569672  0.159328  0.163196  0.160803</programlisting>
+Out[133]: 
+                 size              ...      tip_pct          
+smoker             No       Yes    ...          Yes       All
+time   day                         ...                       
+Dinner Fri   2.000000  2.222222    ...     0.165347  0.158916
+       Sat   2.555556  2.476190    ...     0.147906  0.153152
+       Sun   2.929825  2.578947    ...     0.187250  0.166897
+       Thur  2.000000       NaN    ...          NaN  0.159744
+Lunch  Fri   3.000000  1.833333    ...     0.188937  0.188765
+       Thur  2.500000  2.352941    ...     0.163863  0.161301
+All          2.668874  2.408602    ...     0.163196  0.160803
+[7 rows x 6 columns]</programlisting>
wesm commented 5 years ago

Another example of a dataset that looks fine in 0.20.x and not in 0.23.x

-In [74]: data
-Out[74]: 
-         user_id  movie_id  rating  timestamp gender  age  occupation    zip  \
-0              1      1193       5  978300760      F    1          10  48067   
-1              2      1193       5  978298413      M   56          16  70072   
-2             12      1193       4  978220179      M   25          12  32793   
-3             15      1193       4  978199279      M   25           7  22903   
-4             17      1193       5  978158471      M   50           1  95350   
-...          ...       ...     ...        ...    ...  ...         ...    ...   
-1000204     5949      2198       5  958846401      M   18          17  47901   
-1000205     5675      2703       3  976029116      M   35          14  30030   
-1000206     5780      2845       1  958153068      M   18          17  92886   
-1000207     5851      3607       5  957756608      F   18          20  55410   
-1000208     5938      2909       4  957273353      M   25           1  35401   
-                                               title                genres  
-0             One Flew Over the Cuckoo's Nest (1975)                 Drama  
-1             One Flew Over the Cuckoo's Nest (1975)                 Drama  
-2             One Flew Over the Cuckoo's Nest (1975)                 Drama  
-3             One Flew Over the Cuckoo's Nest (1975)                 Drama  
-4             One Flew Over the Cuckoo's Nest (1975)                 Drama  
-...                                              ...                   ...  
-1000204                           Modulations (1998)           Documentary  
-1000205                        Broken Vessels (1998)                 Drama  
-1000206                            White Boys (1999)                 Drama  
-1000207                     One Little Indian (1973)  Comedy|Drama|Western  
-1000208  Five Wives, Three Secretaries and Me (1998)           Documentary  
-[1000209 rows x 10 columns]
+    <programlisting language="python" format="linespecific">In [74]: data = pd.merge(pd.merge(ratings, users), movies)

-In [75]: data.iloc[0]
+In [75]: data
 Out[75]: 
+         user_id          ...                         genres
+0              1          ...                          Drama
+1              2          ...                          Drama
+2             12          ...                          Drama
+3             15          ...                          Drama
+4             17          ...                          Drama
+...          ...          ...                            ...
+1000204     5949          ...                    Documentary
+1000205     5675          ...                          Drama
+1000206     5780          ...                          Drama
+1000207     5851          ...           Comedy|Drama|Western
+1000208     5938          ...                    Documentary
+[1000209 rows x 10 columns]
gfyoung commented 5 years ago

cc @jreback

wesm commented 5 years ago

I can try to dig into this, but if there's a temporary workaround it would be helpful to know. This is the last item blocking me releasing accumulated errata fixes to be published

WillAyd commented 5 years ago

Does doing pd.options.display.max_columns = 20 not give you what you are looking for, namely the behavior before #17023

jorisvandenbossche commented 5 years ago

Yes, doing that (pd.options.display.max_columns = 20) should restore the old behaviour. See also the whatsnew entry here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#better-pretty-printing-of-dataframes-in-a-terminal

That said, I personally have had several times that the new truncating behaviour was somewhat annoying.

The workaround is to set pd.options.display.max_colwidth = 70 or something similar

I don't think this should influence that

wesm commented 5 years ago

FWIW I think the new behavior is worse. The old behavior is more consistent with what other projects (like R) do IIUC

wesm commented 5 years ago

I also think the assertion that it is "better" or that the old behavior "was relatively difficult to read" is truly in the eye of the beholder / very subjective

wesm commented 5 years ago

Just today a reader was confused by the default options

https://github.com/wesm/pydata-book/issues/95

WillAyd commented 5 years ago

I think to your point before this is a rather subjective call. From personal experience I found the new display options worse at first but changed my stance over time.

Not sure how reasonable this is but does it make sense in the book somewhere to introduce the display options and maybe set the option to 20 globally to reduce the amount of changes?

wesm commented 5 years ago

That's what I'm planning to do...

In general, I would be hesitant to describe a subjective change such as this "better" -- it would be more accurate to say that the pandas developers' consensus was to change the formatting. Instead of saying that something is "difficult" or "easy" it would be better to say that "many users feel that..." or "some people have said that...". If other people feel differently than I do, I respect their opinion. Notice that I say "I think the behavior is worse" (my opinion) and not "The behavior is worse"

jorisvandenbossche commented 5 years ago

I am personally often getting into specific situations where the new default style is IMO quite a bit worse than before.

Eg compare this:

In [8]: import geopandas 
   ...: world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres')) 

In [9]: pd.options.display.max_columns = 0   # default

In [17]: world.head()                                                           
Out[17]: 
      pop_est                        ...                                                         
          geometry
0  28400000.0                        ...                          POLYGON ((61.2
1081709172574 35.65007233330923,...
1  12799293.0                        ...                          (POLYGON ((16.
32652835456705 -5.87747039146621...
2   3639453.0                        ...                          POLYGON ((20.5
9024743010491 41.85540416113361,...
3   4798491.0                        ...                          POLYGON ((51.5
7951867046327 24.24549713795111,...
4  40913584.0                        ...                          (POLYGON ((-65
.50000000000003 -55.199999999999...

[5 rows x 6 columns]

with this (the old default):

In [18]: pd.options.display.max_columns = 20                                    

In [19]: world.head()                                                           
Out[19]: 
      pop_est      continent                  name iso_a3  gdp_md_est  \
0  28400000.0           Asia           Afghanistan    AFG     22270.0   
1  12799293.0         Africa                Angola    AGO    110300.0   
2   3639453.0         Europe               Albania    ALB     21810.0   
3   4798491.0           Asia  United Arab Emirates    ARE    184300.0   
4  40913584.0  South America             Argentina    ARG    573900.0   

                                            geometry  
0  POLYGON ((61.21081709172574 35.65007233330923,...  
1  (POLYGON ((16.32652835456705 -5.87747039146621...  
2  POLYGON ((20.59024743010491 41.85540416113361,...  
3  POLYGON ((51.57951867046327 24.24549713795111,...  
4  (POLYGON ((-65.50000000000003 -55.199999999999...  

This might be quite specific to this use case (where the 'geometry' column is often a wide (truncated) column, so quickly gives a case where the columns don't fit on one line).

I would also say that the current behaviour is a bug: the output is too wide for the specific console size I was doing this in (and this size was correctly detected), so you get an overflow, although there is a lot of unneeded whitespace between the two columns.

jorisvandenbossche commented 5 years ago

@pandas-dev/pandas-core @cbrnr now the new default has been out there for a while, what are the opinions about it in general? Mostly an improvement?

(it's quite possible that the specific cases I am working with like the one above are not very representative, and that in many other cases it is an improvement)

WillAyd commented 5 years ago

I'd be in favor of reverting

cbrnr commented 5 years ago

For me the new behavior has worked quite well (standard terminal on macOS). However, it seems like there are still some rough edges in the way the number of columns is detected (and maybe in some environments this is even impossible). All counter-examples where the new behavior is worse are actually cases where automatic detection is not working correctly (or so it seems), because lines shouldn't overflow and the available columns should be used as efficiently as possible.

In cases where the number of columns cannot be detected, pandas should default to the old behavior of 20 columns. If headless IPython doesn't work and can be detected somehow, this case should be added to default to the old setting.

@wesm I agree that this is a subjective decision, and it is especially bad that the new behavior doesn't work as expected in many cases. I still think that if the new behavior works, it is a more convenient overview of a data frame. Note that one motivation of this change is that this is the default behavior in the Tidyverse - so whereas you are right that base R does behave like the old setting, Tidyverse packages (tibble) behave like the new setting (except that it seems to work more consistently because there is really only one IDE that people are using, namely RStudio).

I still think a .pandasrc file would go a long way in making this debate obsolete, because then a default setting wouldn't be that important anymore.

jorisvandenbossche commented 5 years ago

Note: my example above is a case where the console width is correctly detected. But apparently there are cases where we generate a repr with the new default that is still wider than that.

I still think a .pandasrc file would go a long way in making this debate obsolete, because then a default setting wouldn't be that important anymore.

This is of course limited to ipython-based environments (console, notebook), but you can already put that in a startup script. And even if we have something like: many users are not aware of the fact they can change the display behaviour, and we still need to make sure the default behaviour (what users see at first) is a good one in general.

cbrnr commented 5 years ago

This is of course limited to ipython-based environments (console, notebook), but you can already put that in a startup script.

I just don't want to import pandas whenever I start IPython/Jupyter because I don't always use it.

And even if we have something like: many users are not aware of the fact they can change the display behaviour, and we still need to make sure the default behaviour (what users see at first) is a good one in general.

100% agreed. If the old setting works better for some corner cases, then the new setting should either be fixed (meaning that the repr should never exceed the console width; it looks like this could be related to how the column widths are determined), or pandas should revert to the old setting.

jorisvandenbossche commented 5 years ago

Other example that might be a bug, in a datacamp integrated console:

screenshot_2019-01-24 map of tree density by district 1 python

But this one is very strange, as I don't see this locally, even if I reduce my terminal width to 80, it still shows 6 out of 10 columns (since that is what is fitting in 80 char width). @cbrnr do you know, if we cannot detect the terminal width (which is the case here), it should default to 80 no?

cbrnr commented 5 years ago

I'm not sure. The behavior depends on pd.io.formats.terminal.is_terminal - this function detects whether Python is IPython or not. If not, it assumes that it is in a terminal and thus returns True (this might be problematic). If it detects IPython, it only returns True if there is no Jupyter kernel attached, otherwise it returns False. The new behavior is only activated if is_terminal returns True, otherwise the default is set to 20.

In the Datacamp shell, pd.io.formats.terminal.is_terminal() returns True, but pd.io.formats.terminal.get_terminal_size() is os.terminal_size(columns=0, lines=0). I guess is_terminal should take the output of get_terminal_size into account and return False if it has values of 0.

jorisvandenbossche commented 5 years ago

In the Datacamp shell, pd.io.formats.terminal.is_terminal() returns True, but pd.io.formats.terminal.get_terminal_size() is os.terminal_size(columns=0, lines=0). I guess is_terminal should take the output of get_terminal_size into account and return False if it has values of 0.

Yep, that seems to be the case. +1

cbrnr commented 5 years ago

In addition, even if we switch back to 20 columns in such cases, we still need a value for the number of columns. Is this currently 80? This might also lead to problems even with the old behavior if we guess the wrong value.

jorisvandenbossche commented 5 years ago

Normally, we fall back to pd.options.display.width, which has a default of 80 (but so is also something you can set)