scraperwiki / spreadsheet-download-tool

A ScraperWiki plugin for downloading data from a box as a CSV or Excel spreadsheet
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Corrupt .xlsx files generated #67

Closed frabcus closed 10 years ago

frabcus commented 10 years ago

A customer reported error "Excel found unreadable content" when opening in Excel 2010. This is for Twitter search data.

@IanHopkinson could reproduce on his computer.

frabcus commented 10 years ago

Intercom.io thread: https://www.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/messages/1041591/message_threads/1322920

IanHopkinson commented 10 years ago

This is the error/fix message I get from Excel 2013 on trying to open the file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error090440_01.xml</logFileName><summary>Errors were detected in file 'C:\Users\Ian\Desktop\all_tables.xlsx'</summary><removedRecords summary="Following is a list of removed records:"><removedRecord>Removed Records: Formula from /xl/worksheets/sheet1.xml part</removedRecord></removedRecords></recoveryLog>
frabcus commented 10 years ago

I think this is things like ^C in the file.

frabcus commented 10 years ago

We think this is a bug in PyExcelerate.

frabcus commented 10 years ago

Link to customer: https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/394619123 To dataset: https://scraperwiki.com/dataset/wkw8onr/view/9hlpcjb

President Clinton has ^C characters in his byline!!!

scraperdragon commented 10 years ago

We're using a custom version of PyExcelerate folded into the code. sheetStream - https://github.com/scraperwiki/spreadsheet-download-tool/blob/master/pyexcelerate/Writer.py#L54 - is rendering the faulty XML. env and FileSystemLoader appear to be parts of Jinja2, a templating language. The offending template is probably https://github.com/scraperwiki/spreadsheet-download-tool/blob/master/pyexcelerate/templates/xl/worksheets/sheet.xml which has a {{ cell }} which contains all the text of the cell: i.e.

<c r="A3"><v>6</v></c>

So we need to find this cell function.

scraperdragon commented 10 years ago

... seriously, WTF? - you can't encode non-tab-CR-LF characters in XML? At all? And Microsoft works around this with some hokey _x0003_ string substitution? http://stackoverflow.com/questions/4513672/python-escaping-non-ascii-characters-in-xml

(╯°□°)╯︵ ┻━┻

... on the plus side, this substitution kinda works, it makes wierd symbols that don't copy-paste properly that look like superscript L's... but we should probably use proper unicode "nope you don't get this symbol" symbols.

scraperdragon commented 10 years ago

So. Literal _x0003_ roundtrips okay in Excel (with an encoding of _x005F_x0003_: i.e. _ + x0003_), but LibreOffice fails to parse it, returning the encoded version verbatim. At least we're in good company with not supporting this properly!

Replacement with U+FFFD REPLACEMENT CHARACTER is probably the right move.

scraperdragon commented 10 years ago

The offending line: https://github.com/scraperwiki/spreadsheet-download-tool/blob/master/pyexcelerate/Worksheet.py#L126 - escape only escapes &, <, and >. Writing replacement function.