martijn / xsv

High performance, lightweight .xlsx parser for Ruby that provides nothing a CSV parser wouldn't
https://storck.io/posts/announcing-xsv-1-0-0/
MIT License
194 stars 20 forks source link

Fix: Access sheets with UTF-8 characters in the name #51

Closed sebikeller closed 1 year ago

sebikeller commented 1 year ago

Calling Workbook::sheets_by_name with a name containing an UTF-8 string, will now return the correct sheet.

Ruby stores string with UTF-8 encoding per default. (Unless specified otherwise by using # encoding: <ENC> or String.new )

"ascii".encoding  # => #<Encoding:UTF-8>
"☼".encoding      # => #<Encoding:UTF-8>

The io.sysread(...) method call in the sax_parser.rb used to gather the data from the xml within the zip file returns a binary string.

"ascii".force_encoding(Encoding::BINARY).encoding        # => #<Encoding:ASCII-8BIT>
"ascii".force_encoding(Encoding::BINARY) == "ascii"      # => true
"utf-8 ☼".force_encoding(Encoding::BINARY) == "utf-8 ☼"  # => false

This is no problem for comparison as long as the bytes of the strings match. But introducing non ASCII-Characters, the string comparison is not matching anymore.

Assuming a sheet is named Geschäftsabschluss, then the following code would return an empty array without this change:

# before merging this pull request
workbook = Xsv.open("file.xlsx")                                                # => #<Xsv::Workbook sheets=1 trim_empty_rows=false>
workbook.sheets_by_name("Geschäftsabschluss")                                   # => []
workbook.sheets_by_name("Geschäftsabschluss".force_encoding(Encoding::BINARY))  # => [#<Xsv::Sheet mode=array>]

Forcing the sheet names to UTF-8 is a good enough fix from my perspective, since the XML is already stored in UTF-8.

It is how ever a breaking change. As the code above would return different results after merge:

# after merging this pull request
workbook = Xsv.open("file.xlsx")                                                # => #<Xsv::Workbook sheets=1 trim_empty_rows=false>
workbook.sheets_by_name("Geschäftsabschluss")                                   # => [#<Xsv::Sheet mode=array>]
workbook.sheets_by_name("Geschäftsabschluss".force_encoding(Encoding::BINARY))  # => []
sebikeller commented 1 year ago

To keep the change working with older code workbook.rb could be changed too:

  # Returns an array of sheets for the case of same name sheets.
  # @param [String] name
  # @return [Array<Xsv::Sheet>]
  def sheets_by_name(name)
+    name = name.dup.force_encoding(Encoding::UTF_8) if name.is_a?(String) and name.encoding == Encoding::ASCII_8BIT
     @sheets.select { |s| s.name == name }
  end

Maybe also just that place could be changed to convert to Encoding::BINARY if the string is not yet encoded in binary to fix this function call to work with unicode strings. But i find, that it is more natural that the sheet names are UTF-8 strings, not binary strings.

martijn commented 1 year ago

Leaking binary strings here was definitely an oversight. Thanks for your contribution and detailed pull request! I will release a new gem version with this change soon.

sebikeller commented 1 year ago

Glad to hear that 😄 Did you already decide on how to handle the possible breaking change?

martijn commented 1 year ago

Well, the previous behaviour (that required a workaround to access sheets with UTF-8 characters by name) was definitely incorrect and existing tests (that tested with 7-bit characters only) weren't broken so I think the breakage is acceptable here.

I'd rather not carry the suggested backward-compatible change in Xsv forever as this piece of the code is already pretty messy.

sebikeller commented 1 year ago

Thank you for your answer. I can follow and understand your reasoning.

I only commented about the code for a compatibility solution and did not include it in the PR directly, since the current way to access sheets with non ascii names seems incorrect to me too.