pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.54k stars 17.89k forks source link

DataFrame rows printed incorrectly as ... (ellipsis) in v0.23 #21337

Open greglandrum opened 6 years ago

greglandrum commented 6 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.show_versions()
pd.set_option('display.width', 10000)
pd.set_option('display.max_colwidth', 10000)
d = [['CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C', 'Penicilline G'],
     ['CC1(C2CC3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4O)O)O)O)C(=O)N)N(C)C)O',
      'Tetracycline'], ['CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C', 'Ampicilline']]
antibiotics = pd.DataFrame(d, columns=['Smiles', 'Name'])
print('1---------------------')
print(antibiotics)
antibiotics2 = pd.DataFrame([(y, x) for x, y in d], columns=['Name', 'Smiles'])
print('2---------------------')
print(antibiotics2)

Problem description

Here's the output with v0.23:

glandrum@otter:/scratch/RDKit_git/rdkit/Chem$ python ~/foo.py

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
1---------------------
                                                                 Smiles           Name
0                       CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C  Penicilline G
1  CC1(C2CC3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4O)O)O)O)C(=O)N)N(C)C)O   Tetracycline
2                    CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C    Ampicilline
2---------------------
  ...
0 ...
1 ...
2 ...

[3 rows x 2 columns]

I believe that, since I am setting display.width and dispaly.max_colwidth I should not be seeing ellipsis under any circumstances (I did not with v0.22, see below).

Additionally: it doesn't seem like the column order should make any difference.

Expected Output

Here's what I get with Pandas v0.22:

(py36_rdkit) glandrum@otter:/scratch/RDKit_git/rdkit/Chem$ python ~/foo.py

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
1---------------------
                                                                 Smiles           Name
0                       CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C  Penicilline G
1  CC1(C2CC3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4O)O)O)O)C(=O)N)N(C)C)O   Tetracycline
2                    CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C    Ampicilline
2---------------------
            Name                                                                Smiles
0  Penicilline G                       CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C
1   Tetracycline  CC1(C2CC3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4O)O)O)O)C(=O)N)N(C)C)O
2    Ampicilline                    CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C

Output of pd.show_versions()

see above
gfyoung commented 6 years ago

I actually cannot reproduce this on master (I get the same output as you describe for v0.22.0).

Perhaps we can add a test for this in the repository and close this out?

greglandrum commented 6 years ago

Can you reproduce it with v0.23.0?

greglandrum commented 6 years ago

hmm, I just tried on my Windows machine and it doesn't happen there in a bash shell, but does when I run it in whatever the "Anaconda Prompt" uses.

I have observed the problem under both Ubuntu 18.04 and Centos 6 on the linux side.

gfyoung commented 6 years ago

hmm, I just tried on my Windows machine and it doesn't happen there. I have observed the problem under both Ubuntu 18.04 and Centos 6 on the linux side.

@greglandrum : Same type of deal here.

uds5501 commented 6 years ago

@greglandrum True that, used Windows machine and could not reproduce the same.

Code:

image

and I am using Linux now yet I cannot reproduce the same..any other method that you could suggest to triage the same?

image

jorisvandenbossche commented 6 years ago

This is due to the change of the default of pandas.options.display.max_columns of 20 to 0. If you set this manually to 20, it should fix your problem. Although we should still find a solution as not printing any data is a bit strange ..

greglandrum commented 6 years ago

Confirmed that setting pandas.options.display.max_columns to 20 fixes the problem for me. Thanks!

I understand setting the default to 0 so that things work nicely by default in the terminal or when using IPython, but it doesn't seem correct that the logic to auto-detect the terminal width that max_columns=- triggers overrides the value of pandas.options.display.width that I set.

jorisvandenbossche commented 6 years ago

If you would only have used pd.set_option('display.max_colwidth', 10000), I think we can say it is up to the responsibility of the user to also change the max_columns option appropriately (it is difficult to have a good default that also works well in combination with every other possible setting). But by also setting display.width, I would expect that it should honour that, as you say.

jorisvandenbossche commented 6 years ago

Welcome to look into it!

So the 'bug' is that with pd.options.display.max_columns = 0 it does not follow the specified pd.options.display.width

In [8]: df = pd.DataFrame(np.random.randn(5,10))

In [9]: df   <----- default repr
Out[9]: 
          0         1         2         3    ...            6         7         8         9
0 -0.114989 -1.313691 -1.012763  1.210505    ...     1.741739 -0.293015 -0.518975 -0.046243
1 -1.001067 -0.896490 -0.106518  0.080232    ...    -0.845052  0.272609 -0.983768  0.963105
2  0.665218 -0.318560 -1.127493 -2.073078    ...    -0.421405  1.653298  0.989827  1.392743
3 -1.298340 -0.441758  1.551385 -1.389610    ...     1.113139  0.970295 -2.177596 -0.909323
4  0.206135  0.292685  1.570472 -0.065448    ...     0.780934  1.921372  0.256083 -0.499103

[5 rows x 10 columns]

In [10]: pd.options.display.width = 40

In [11]: df   <----- not honouring the width
Out[11]: 
          0         1         2         3    ...            6         7         8         9
0 -0.114989 -1.313691 -1.012763  1.210505    ...     1.741739 -0.293015 -0.518975 -0.046243
1 -1.001067 -0.896490 -0.106518  0.080232    ...    -0.845052  0.272609 -0.983768  0.963105
2  0.665218 -0.318560 -1.127493 -2.073078    ...    -0.421405  1.653298  0.989827  1.392743
3 -1.298340 -0.441758  1.551385 -1.389610    ...     1.113139  0.970295 -2.177596 -0.909323
4  0.206135  0.292685  1.570472 -0.065448    ...     0.780934  1.921372  0.256083 -0.499103

[5 rows x 10 columns]

In [12]: pd.options.display.max_columns = 20

In [13]: df   <----- now it does follow the width
Out[13]: 
          0         1         2  \
0 -0.114989 -1.313691 -1.012763   
1 -1.001067 -0.896490 -0.106518   
2  0.665218 -0.318560 -1.127493   
3 -1.298340 -0.441758  1.551385   
4  0.206135  0.292685  1.570472   

          3         4         5  \
0  1.210505  0.490016  0.990289   
1  0.080232 -0.490654  0.256616   
2 -2.073078  0.558430 -0.324658   
3 -1.389610  0.745468  0.544909   
4 -0.065448 -0.682855  0.377820   
...

cc @cbrnr

gfyoung commented 6 years ago

@jorisvandenbossche : I think we just need to test this on master, since I was unable to reproduce with master on both Windows and Linux.

jorisvandenbossche commented 6 years ago

I can reproduce this on master (and I also would not know what would have changed this since 0.23.0)

gfyoung commented 6 years ago

I can reproduce this on master (and I also would not know what would have changed this since 0.23.0)

Huh...that's odd. I wonder if it's OS-specific? I just used Windows 10 and Ubuntu 14.04.

Marking as regression then.

jorisvandenbossche commented 6 years ago

Did you make sure to use a "small" terminal? (otherwise you don't have the problem)

gfyoung commented 6 years ago

Did you make sure to use a "small" terminal? (otherwise you don't have the problem)

@jorisvandenbossche : Evidently not small enough 😄

jreback commented 6 years ago

note if someone really wants to block on this, pls move back to 0.23.1, but unless there is an active PR (or you really really want to delay things), pls dont

rschwiebert commented 6 years ago

More data points: @gfyoung It could be system specific. I ran into this problem while testing an application locally on macos 10.13.6 (where a four column dataframe was correctly displaying all columns) and in a kubernetes pod in Ubuntu 16.04 (where it only displayed 3 of 4 columns). The code was identical for both, setting max_columns=5 and width=120. (pandas 0.23.3)