BUG: Corrupted excel output when zero in top left of DataFrame

pythonator commented 10 years ago

When DataFrame().values[0,0] is a 0 integer, the excel output is corrupted. When opening the excel file in Microsoft Excel 2010 I get:

Excel found unreadable content in 'a.xlsx'.
Do you want to recover the contents of this workbook?
If you trust the source of this workbook, click Yes.

When I click Yes, I get:

Excel was able to open the file by repairing or removing the unreadable content.
Removed Part: /xl/sharedStrings.xml part with XML error. 
(Strings) Illegal xml character. Line 1, column 95.
Removed Records: Cell information from /xl/worksheets/sheet1.xml part

The cell that should have been 0 is now empty.

Here is an example:

import pandas
import numpy as np
a = np.arange(25).reshape(5,5)
pandas.DataFrame(a).to_excel('./a.xlsx') # this is 'unreadable'
pandas.DataFrame(a.astype(float)).to_excel('./afloat.xlsx') # this is OK
a[0,0] = 1
pandas.DataFrame(a).to_excel('./a1.xlsx') # this is OK

Packages: (from 'pip list' as I couldn't find ci/print_versions.py mentioned in CONTRIBUTING) pandas (0.12.0) xlrd (0.9.2) openpyxl (1.6.2)

ghost commented 10 years ago

Thanks, I updated CONTRIBUTING.MD accordingly.

I can't reproduce this with 0.12.0, but my system has openpyxl 1.7.0. Can you upgrade and see if the problem goes away?

pythonator commented 10 years ago

Hi, I upgraded openpyxl to 1.8.0 (with pip), double checked that python was using it with openpyxl.__version__ and the excel file was still unreadable. On another note, my pandas.util package does not contain print_versions. (I double checked that its 0.12.0) I am using Windows 7 x64 if that matters. Here is my full pip list if it helps: Bottleneck (0.7.0) glumpy (0.2.1) gmpy2 (2.0.2) h5py (2.2.0) html5lib (0.99) ipython (1.1.0) Jinja2 (2.7.1) MarkupSafe (0.18) matplotlib (1.3.1) nose (1.3.0) numexpr (2.2.2) numpy (1.7.1) openpyxl (1.8.0) pandas (0.12.0) patsy (0.2.1) Pillow (2.2.1) pip (1.4.1) pycairo (1.10.0) Pygments (1.6) pyparsing (2.0.1) pyreadline (2.0) PySide (1.2.1) pytest (2.4.2) python-dateutil (2.1) pytools (2013.5.6) pyttsx (1.1) pytz (2013.7) pywin32 (218.4) pyzmq (13.1.0) requests (1.2.3) scipy (0.12.1) Scipy-stack (13.10.11) setuptools (1.1.6) simplejson (3.3.1) six (1.4.1) speech (0.5.2) statsmodels (0.5.0) sympy (0.7.3) tables (3.0.0) tornado (3.1.1) virtualenv (1.10.1) xlrd (0.9.2)

jtratner commented 10 years ago

I'm pretty sure this is an issue with openpyxl rather than pandas. It will take some time to come up with a reproducible example for openpyxl. That said, what you could do is upgrade to the newly released 0.13 and install xlsxwriter. That may resolve your issue completely.

jmcnamara commented 10 years ago

I'll take a look at this to see if I can reproduce it.

jmcnamara commented 10 years ago

This issue isn't in the GitHub master/0.13 version of Pandas with openpyxl 1.6.2 or 1.8.0, which suggests that it was probably a Pandas issue that was fixed in the 0.12-0.13 timeframe.

Upgrading Pandas should resolve this issue.

ghost commented 10 years ago

I tested it agin with 0.12 and and using openpyxl 1.6.2. I'm using openoffice rather then excel and on linux, but I still can't reproduce this.

@pythonator, any more relevent info?

jreback commented 10 years ago

closing as fixed

pandas-dev / pandas

BUG: Corrupted excel output when zero in top left of DataFrame #5865