pythonicrubyist / creek

Ruby library for parsing large Excel files.
http://rubygems.org/gems/creek
MIT License
386 stars 109 forks source link

Character sequence "_xhtml_" is replaced with NUL byte #88

Closed sys-64738 closed 4 years ago

sys-64738 commented 4 years ago

In #73, a change was made to recognize special hex code escape sequence of the format x, e.g. _x000D_ for carriage return characters.

However, the code to recognize these character squences is using the regular expression HEX_ESCAPE_REGEXP = /_x[0-9A-Za-z]{4}_/ which finds more than only hex sequences.

This was causing a problem for me with a spreadsheet that somewhere contained the (unescaped) string "_xhtml_". creek replaced it with a NUL byte (\0), which was causing errors in my application.

An example for a sharedStrings.xml (created manually with LibreOffice) to reproduce the issue: `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Test Case NameDescriptionStep InstructionsExpected ResultsAB CA_xhtml_BDA_x005F_x000D_B` I entered both "\_xhtml\_" and "\_x000D\_" manually in LibreOffice. You can see that in case of the second value it escaped the first underscore, but the "\_xhtml\_" string was not escaped (because it is not a hex value). I guess you fix the issue by matching only `/_x[0-9A-Fa-f]{4}_/ ?` Thanks in advance!
pythonicrubyist commented 4 years ago

Fixed.