tafia / calamine

A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets
MIT License
1.69k stars 158 forks source link

_x000D_ kind of value in string cell should be unescaped #469

Open yorkz1994 opened 3 days ago

yorkz1994 commented 3 days ago

image Take this excel value for example, the value is multi line. After run below code to print the cell value:

fn main() {
    let mut wb: Xlsx<_> = calamine::open_workbook("Book1.xlsx").unwrap();
    let ws = wb.worksheet_range("Sheet1").unwrap();
    let data = ws.get_value((0, 0)).unwrap();
    dbg!(data);
}

Output:

[src/main.rs:7:5] data = String(
    "ABC_x000D_\r\nDEF",        
)

Expected output:

[src/main.rs:7:5] data = String(
    "ABC\r\nDEF",        
)

Golang excelize library handle it correctly. Reference Book1.xlsx

jmcnamara commented 2 days ago

If it helps here is how rust_xlsxwriter encodes these characters in the opposite direction:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/xmlwriter.rs#L204-L248

And here is a test file with each of the characters from 0..127:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings01.xlsx

However, as mentioned in the Reference link you need to also handle escaped literal strings which are prefixed by _x005F_. For example a string stored as _x005F_x0000_ in /xl/sharedStrings.xml would be displayed in Excel as _x0000_.

There is a test file for strings like that here:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings02.xlsx

yorkz1994 commented 1 day ago

@jmcnamara

Thanks. This information is very useful. I check the code, it seems only _x00HH_ literals are escaped. If other valid _xHHHH_ literals are skipped, then when doing read, excel will not treat them as literal anymore. For example this *_x597D_*, if you don't escape it, when read back into excel, we got *好*, but we expect *_x597D_* back. image

jmcnamara commented 1 day ago

If other valid _xHHHH_ literals are skipped, then when doing read, excel will not treat them as literal anymore.

You are correct. That is a bug in rust_xlsxwriter. :-| Update: fixed.