Closed frabcus closed 10 years ago
This is the error/fix message I get from Excel 2013 on trying to open the file
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error090440_01.xml</logFileName><summary>Errors were detected in file 'C:\Users\Ian\Desktop\all_tables.xlsx'</summary><removedRecords summary="Following is a list of removed records:"><removedRecord>Removed Records: Formula from /xl/worksheets/sheet1.xml part</removedRecord></removedRecords></recoveryLog>
I think this is things like ^C in the file.
We think this is a bug in PyExcelerate.
Link to customer: https://app.intercom.io/apps/63b0c6d4bb5f0867b6e93b0be9b569fb3a7ab1e3/conversations/394619123 To dataset: https://scraperwiki.com/dataset/wkw8onr/view/9hlpcjb
President Clinton has ^C characters in his byline!!!
We're using a custom version of PyExcelerate folded into the code.
sheetStream
- https://github.com/scraperwiki/spreadsheet-download-tool/blob/master/pyexcelerate/Writer.py#L54 - is rendering the faulty XML.
env
and FileSystemLoader
appear to be parts of Jinja2, a templating language.
The offending template is probably https://github.com/scraperwiki/spreadsheet-download-tool/blob/master/pyexcelerate/templates/xl/worksheets/sheet.xml which has a {{ cell }}
which contains all the text of the cell: i.e.
<c r="A3"><v>6</v></c>
So we need to find this cell function.
... seriously, WTF? - you can't encode non-tab-CR-LF characters in XML? At all?
And Microsoft works around this with some hokey _x0003_
string substitution?
http://stackoverflow.com/questions/4513672/python-escaping-non-ascii-characters-in-xml
(╯°□°)╯︵ ┻━┻
... on the plus side, this substitution kinda works, it makes wierd symbols that don't copy-paste properly that look like superscript L's... but we should probably use proper unicode "nope you don't get this symbol" symbols.
So. Literal _x0003_
roundtrips okay in Excel (with an encoding of _x005F_x0003_
: i.e. _
+ x0003_
), but LibreOffice fails to parse it, returning the encoded version verbatim.
At least we're in good company with not supporting this properly!
Replacement with U+FFFD REPLACEMENT CHARACTER is probably the right move.
The offending line: https://github.com/scraperwiki/spreadsheet-download-tool/blob/master/pyexcelerate/Worksheet.py#L126 - escape only escapes &
, <
, and >
. Writing replacement function.
A customer reported error "Excel found unreadable content" when opening in Excel 2010. This is for Twitter search data.
@IanHopkinson could reproduce on his computer.