mjul / docjure

Read and write Office documents from Clojure
MIT License
619 stars 129 forks source link

Excel Escape Sequences #61

Open rossgibb opened 7 years ago

rossgibb commented 7 years ago

For posterity:

There is a known issue with Excel escape sequences of the form _x[0-9a-fA-F]{4}_

https://bz.apache.org/bugzilla/show_bug.cgi?id=57008

Demonstrated with this:

user> (use 'dk.ative.docjure.spreadsheet)

user> (let [wb (create-workbook "Test"
                                [["foo_x1070_bar"]])]
        (save-workbook! "/tmp/spreadsheet.xlsx" wb))

The _x1070_ gets replaced by something unprintable:

image

To prevent this the escape sequence _x005F must be prepended:

user> (let [wb (create-workbook "Test"
                                [["foo_x005F_x1070_bar"]])]
        (save-workbook! "/tmp/spreadsheet.xlsx" wb))
image

The function does the escaping, should be passed a sequence of sequences.

user> (defn spreadsheet-escape [data]
        (for [row data]
          (map #(if (string? %)
                  (clojure.string/replace
                   %
                   #"(?i)(_x[0-9a-f]{4}_)"
                   "_x005F$1")
                  %)
               row)))
#'user/spreadsheet-escape
user> (spreadsheet-escape [["foo_x1070_bar"]])
(("foo_x005F_x1070_bar"))
mjul commented 7 years ago

Hi Ross

Thanks for the detailed information. I was not aware of this so I am very happy that you have put in the work to investigate this issue.

Encoding and decoding the escape sequences seems to be something that should be part of Apache POI, not Docjure. However, as it isn’t yet, it is now a leaky abstraction manifesting in Docjure.

We should probably to the String encoding and decoding in Docjure (and add a unit test in Docjure verifying that Apache POI is not doing it, so we will know when to remove it again.)

If you are up for that, please submit a pull request.

All the best, Martin

rossgibb commented 7 years ago

This issue slipped my mind. I have been using the escape function with no trouble for a number of months, and I wanted to contribute it back.

I did some more digging.

Spreadsheets that are in the xls format do not seem to be affected:

dk.ative.docjure.spreadsheet-test> (let [sheet-name "Sheet 1"
                                     sheet-data [["A1"]]
                                     workbook (create-xls-workbook sheet-name sheet-data)
                                         a1 (-> workbook (.getSheetAt 0) (.getRow 0) (.getCell 0))]

                                     (set-cell! a1 "foo_x220F_bar")
                                     (.getStringCellValue a1))

"foo_x220F_bar"

The sequence is also case sensitive, this is demonstrates the replacement:

dk.ative.docjure.spreadsheet-test> (let [sheet-name "Sheet 1"
                                     sheet-data [["A1"]]
                                     workbook (create-workbook sheet-name sheet-data)
                                         a1 (-> workbook (.getSheetAt 0) (.getRow 0) (.getCell 0))]
                                     (set-cell! a1 "foo_x220F_bar")
                                     (.getStringCellValue a1))
"foo∏bar"

The 16-bit hex sequence must use upper case letters or replacement doesn't happen. For example, if using _x220f:

dk.ative.docjure.spreadsheet-test> (let [sheet-name "Sheet 1"
                                     sheet-data [["A1"]]
                                     workbook (create-workbook sheet-name sheet-data)
                                         a1 (-> workbook (.getSheetAt 0) (.getRow 0) (.getCell 0))]
                                     (set-cell! a1 "foo_x220f_bar")
                                     (.getStringCellValue a1))
"foo_x220f_bar"

The leading _x must be lowercase as well:

dk.ative.docjure.spreadsheet-test> (let [sheet-name "Sheet 1"
                                     sheet-data [["A1"]]
                                     workbook (create-workbook sheet-name sheet-data)
                                         a1 (-> workbook (.getSheetAt 0) (.getRow 0) (.getCell 0))]
                                     (set-cell! a1 "foo_X220F_bar")
                                     (.getStringCellValue a1))
"foo_X220F_bar"

I created this PR: https://github.com/mjul/docjure/pull/65

It adds the escape-cell function and some unit tests. I didn't see the need to alter any existing behaviour of the docjure code. A user running into this issue should be able to find escape-cell and the docstring there explains the issue.