Closed zweger closed 7 years ago
Thanks for the report.
Can those characters be entered in Excel? As far as I can see they can't. Which means I can't generate a test file and which in turn means that, more of less, I can't fix this.
If they can then attach a file generated by Excel and I'll replicate its handling of those characters.
Otherwise, this is probably a won't fix
.
Yes, these characters can be entered into Excel by e.g. copy-pasting them from your browser into Excel. U+FFFE: U+FFFF:
Excel encodes these character identically to the way xlsxwriter
does with the above mentioned fix, e.g. _xFFFE_
.
Regardless of Excel's support for these Unicode non characters, it is a violation of xlsxwriter
's write_string
API to generate an invalid Excel file if the string happens to contain U+FFFF or U+FFFE.
Fixed on master. Thanks for the report.
Hi,
I think it may be the same issue with u'\udda9', xlsxwriter allows it to be written, howerver Excel complains with the error message stated above.
@jbfuzier Same question as above. Can that character be entered manually in Excel?
@jbfuzier Also, Python3 complains about that character:
import xlsxwriter
workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, u"\udda9")
workbook.close()
Error:
UnicodeEncodeError: 'utf-8' codec can't encode character
'\udda9' in position 10: surrogates not allowed
Python 2 doesn't complain but it doesn't write it correctly to the xml file either.
Thanks,
From what I have found, the issue is more on python side, it accept this character however il will never be able to produce a valid utf-8 sting with it.
Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.
Source : https://en.wikipedia.org/wiki/UTF-8
Have I misunderstood? I think that Python is being too strict about rejecting surrogate code points.
No, it is being too lax about allowing them at all.
I believe there is an issue on the tracker (maybe closed) about the doc for unicode escapes in string literals. Perhaps is should say more clearly that inserting surrogates is allowed but results in an invalid string that cannot be normally encoded.
-- Terry Jan Reedy From : https://mail.python.org/pipermail/python-list/2013-October/657270.html
Writing the Unicode noncharacters U+FFFE/U+FFFF creates an invalid Excel file. Excel prompts:
We found a problem with some content in 'file.xlsx'. Do you want to try to recover as much as we can? If you trust the source of this workbook, click Yes.
W3C's XML Recommendation excludes U+FFFE, U+FFFF from the valid character range. I believe the other excluded characters are already escaped by
xlsxwriter
, or not possible (as is the case for the Unicode surrogates U+D800-U+DFFF).I believe the fix is as simple as adding
\uFFFE\uFFFF
to the character classes insharedstrings.py:95
andworksheet.py:5219
.