randym / axlsx

xlsx generation with charts, images, automated column width, customizable styles and full schema validation. Axlsx excels at helping you generate beautiful Office Open XML Spreadsheet documents without having to understand the entire ECMA specification. Check out the README for some examples of how easy it is. Best of all, you can validate your xlsx file before serialization so you know for sure that anything generated is going to load on your client's machine.
MIT License
2.62k stars 695 forks source link

Low-order non-printable ASCII characters in worksheet name should be escaped or removed #654

Open khiav223577 opened 4 years ago

khiav223577 commented 4 years ago

The worksheet name was escaped as HTML since https://github.com/randym/axlsx/issues/112. But It seems to be not working with some special characters, "\b" character for example. I think it should be escaped as XML.

See: https://support.microsoft.com/en-us/help/315580/prb-error-message-when-an-xml-document-contains-low-order-ascii-charac

The following are the character ranges for low-order non-printable ASCII characters that are rejected by MSXML versions 3.0 and later:

x0 - #x8 (ASCII 0 - 8)

xB - #xC (ASCII 11 - 12)

xE - #x1F (ASCII 14 - 31)

Steps to reproduce

  1. Export an excel file by aslsx.
    
    require 'axlsx'

Axlsx::Package.new do |p| p.workbook.add_worksheet(:name => "My sheet\b") do |sheet| sheet.add_row ["Data1", "Data2"] end p.serialize('simple1.xlsx') end


2. Double click `simple1.xlsx` to open it.

It will say that the excel is corrupted.
![2020-08-24_154115](https://user-images.githubusercontent.com/4011729/91017145-68370680-e620-11ea-97b9-3865bf30c9ab.png)

### Possible fixes

We can use `Builder::XChar.encode` to escape the worksheet name.

```rb
require 'axlsx'

Axlsx::Package.new do |p|
  p.workbook.add_worksheet(:name => Builder::XChar.encode("My sheet\b")) do |sheet|
    sheet.add_row ["Data1", "Data2"]
  end
  p.serialize('simple2.xlsx')
end

The produced excel file simple2.xlsx works fine. 2020-08-24_154436

noniq commented 4 years ago

Does it make sense to retain such (escaped) characters in sheet names? In cells for example they are removed, see Cell#clean_value which calls Axlsx.sanitize which removes control characters.

khiav223577 commented 4 years ago

@noniq I'm not sure which is better, removing or escaping non-printable ASCII characters in sheet names. If it is undocumented or is undefined behavior, it's best to be removed. if it is a valid name, it make sense to retain it.

noniq commented 4 years ago

My gut feeling is that non-printable characters should not be allowed to be part of a sheet name. Are there any references to what constitutes a valid sheet name in the Excel docs?

khiav223577 commented 4 years ago

I didn't find any references about what to do with non-printable characters :thinking:. Either removing or escaping is okay to me.

noniq commented 4 years ago

Would you like to prepare a pull request? axlsx is currently unmaintained though, so please create the pull request in the https://github.com/caxlsx/caxlsx repository (that's the actively maintained community fork).