Open rossgibb opened 7 years ago
Hi Ross
Thanks for the detailed information. I was not aware of this so I am very happy that you have put in the work to investigate this issue.
Encoding and decoding the escape sequences seems to be something that should be part of Apache POI, not Docjure. However, as it isn’t yet, it is now a leaky abstraction manifesting in Docjure.
We should probably to the String encoding and decoding in Docjure (and add a unit test in Docjure verifying that Apache POI is not doing it, so we will know when to remove it again.)
If you are up for that, please submit a pull request.
All the best, Martin
This issue slipped my mind. I have been using the escape function with no trouble for a number of months, and I wanted to contribute it back.
I did some more digging.
Spreadsheets that are in the xls
format do not seem to be affected:
dk.ative.docjure.spreadsheet-test> (let [sheet-name "Sheet 1"
sheet-data [["A1"]]
workbook (create-xls-workbook sheet-name sheet-data)
a1 (-> workbook (.getSheetAt 0) (.getRow 0) (.getCell 0))]
(set-cell! a1 "foo_x220F_bar")
(.getStringCellValue a1))
"foo_x220F_bar"
The sequence is also case sensitive, this is demonstrates the replacement:
dk.ative.docjure.spreadsheet-test> (let [sheet-name "Sheet 1"
sheet-data [["A1"]]
workbook (create-workbook sheet-name sheet-data)
a1 (-> workbook (.getSheetAt 0) (.getRow 0) (.getCell 0))]
(set-cell! a1 "foo_x220F_bar")
(.getStringCellValue a1))
"foo∏bar"
The 16-bit hex sequence must use upper case letters or replacement doesn't happen. For example, if using _x220f
:
dk.ative.docjure.spreadsheet-test> (let [sheet-name "Sheet 1"
sheet-data [["A1"]]
workbook (create-workbook sheet-name sheet-data)
a1 (-> workbook (.getSheetAt 0) (.getRow 0) (.getCell 0))]
(set-cell! a1 "foo_x220f_bar")
(.getStringCellValue a1))
"foo_x220f_bar"
The leading _x
must be lowercase as well:
dk.ative.docjure.spreadsheet-test> (let [sheet-name "Sheet 1"
sheet-data [["A1"]]
workbook (create-workbook sheet-name sheet-data)
a1 (-> workbook (.getSheetAt 0) (.getRow 0) (.getCell 0))]
(set-cell! a1 "foo_X220F_bar")
(.getStringCellValue a1))
"foo_X220F_bar"
I created this PR: https://github.com/mjul/docjure/pull/65
It adds the escape-cell
function and some unit tests. I didn't see the need to alter any existing behaviour of the docjure code. A user running into this issue should be able to find escape-cell
and the docstring there explains the issue.
For posterity:
There is a known issue with Excel escape sequences of the form
_x[0-9a-fA-F]{4}_
https://bz.apache.org/bugzilla/show_bug.cgi?id=57008
Demonstrated with this:
The
_x1070_
gets replaced by something unprintable:To prevent this the escape sequence
_x005F
must be prepended:The function does the escaping, should be passed a sequence of sequences.