jmcnamara / XlsxWriter

A Python module for creating Excel XLSX files.
https://xlsxwriter.readthedocs.io
BSD 2-Clause "Simplified" License
3.62k stars 631 forks source link

Issue writing Unicode noncharacters #428

Closed zweger closed 7 years ago

zweger commented 7 years ago

Writing the Unicode noncharacters U+FFFE/U+FFFF creates an invalid Excel file. Excel prompts: We found a problem with some content in 'file.xlsx'. Do you want to try to recover as much as we can? If you trust the source of this workbook, click Yes.

W3C's XML Recommendation excludes U+FFFE, U+FFFF from the valid character range. I believe the other excluded characters are already escaped by xlsxwriter, or not possible (as is the case for the Unicode surrogates U+D800-U+DFFF).

import xlsxwriter

workbook = xlsxwriter.Workbook('unicode_noncharacters.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write_rich_string(0, 0, "\ufffe")
worksheet.write_rich_string(1, 0, "\uffff")
workbook.close()

# Separate code path
workbook = xlsxwriter.Workbook('unicode_noncharacters_optimized.xlsx', 
                               {'constant_memory': True})
worksheet = workbook.add_worksheet()
worksheet.write_rich_string(0, 0, "\ufffe")
worksheet.write_rich_string(1, 0, "\uffff")
workbook.close()

I believe the fix is as simple as adding \uFFFE\uFFFF to the character classes in sharedstrings.py:95 and worksheet.py:5219.

jmcnamara commented 7 years ago

Thanks for the report.

Can those characters be entered in Excel? As far as I can see they can't. Which means I can't generate a test file and which in turn means that, more of less, I can't fix this.

If they can then attach a file generated by Excel and I'll replicate its handling of those characters.

Otherwise, this is probably a won't fix.

zweger commented 7 years ago

Yes, these characters can be entered into Excel by e.g. copy-pasting them from your browser into Excel. U+FFFE: ￾ U+FFFF: ￿

Excel encodes these character identically to the way xlsxwriter does with the above mentioned fix, e.g. _xFFFE_.

Regardless of Excel's support for these Unicode non characters, it is a violation of xlsxwriter's write_string API to generate an invalid Excel file if the string happens to contain U+FFFF or U+FFFE.

jmcnamara commented 7 years ago

Fixed on master. Thanks for the report.

jbfuzier commented 6 years ago

Hi,

I think it may be the same issue with u'\udda9', xlsxwriter allows it to be written, howerver Excel complains with the error message stated above.

jmcnamara commented 6 years ago

@jbfuzier Same question as above. Can that character be entered manually in Excel?

jmcnamara commented 6 years ago

@jbfuzier Also, Python3 complains about that character:

import xlsxwriter

workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()

worksheet.write(0, 0, u"\udda9")

workbook.close()

Error:

UnicodeEncodeError: 'utf-8' codec can't encode character
      '\udda9' in position 10: surrogates not allowed

Python 2 doesn't complain but it doesn't write it correctly to the xml file either.

jbfuzier commented 6 years ago

Thanks,

From what I have found, the issue is more on python side, it accept this character however il will never be able to produce a valid utf-8 sting with it.

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

Source : https://en.wikipedia.org/wiki/UTF-8

Have I misunderstood? I think that Python is being too strict about rejecting surrogate code points.

No, it is being too lax about allowing them at all.

I believe there is an issue on the tracker (maybe closed) about the doc for unicode escapes in string literals. Perhaps is should say more clearly that inserting surrogates is allowed but results in an invalid string that cannot be normally encoded.

-- Terry Jan Reedy From : https://mail.python.org/pipermail/python-list/2013-October/657270.html