jmcnamara / rust_xlsxwriter

A Rust library for creating Excel XLSX files.
https://crates.io/crates/rust_xlsxwriter
Apache License 2.0
316 stars 25 forks source link

Fix escapes for XmlWriter #7

Closed Fight-For-Food closed 1 year ago

Fight-For-Food commented 1 year ago

Fix indexing by chars using indexes of bytes. This is lead to panic if there is a string with symbol(s) which is larger than 1 byte

jmcnamara commented 1 year ago

Thanks for that. Nice refactoring too. Merged.

jmcnamara commented 1 year ago

@Fight-For-Food If you don't mind I have a related question to this PR.

I extended the escaping of cell string data to account for some additional XML escaping that Excel does for cases with low digit byte/control characters. For example '\x00' is escaped to "_x0000_":

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/xmlwriter.rs#L207

However, strings that match the escaped string also need to be escaped so "_x0000_" in a user string would need to be escaped to "_x005F_x0000_" (Where 0x5F is "_" ). And clearly this would need to be done before the other escapes.

I'd like to add an additional regex based escape like this:

// Excel escapes control characters with _xHHHH_ and also escapes any literal
// strings of that type by encoding the leading underscore. So "\0" -> _x0000_
// and "_x0000_" -> _x005F_x0000_.
fn escape_xml_escapes(si_string: &str) -> Cow<str> {
    lazy_static! {
        static ref XML_ESCAPE: Regex = Regex::new(r"(_x[0-9a-fA-F]{4}_)").unwrap();
    }
    XML_ESCAPE.replace_all(si_string, "_x005F$1")
}

However, I'm having difficultly fitting it into the current sting escape/conversion code due to the differences between Cow<str> and &str handling. Any suggestions on a workable way to handle this?

jmcnamara commented 1 year ago

You can ignore this question. The rubber duck debugging worked.

This is what I went with:

$ git diff
diff --git a/src/xmlwriter.rs b/src/xmlwriter.rs
index 8cee80c..a074205 100644
--- a/src/xmlwriter.rs
+++ b/src/xmlwriter.rs
@@ -9,6 +9,7 @@ use std::borrow::Cow;
 use std::fs::File;
 use std::io::{BufWriter, Read, Seek, Write};

+use regex::Regex;
 use tempfile::tempfile;

 pub struct XMLWriter {
@@ -145,14 +146,14 @@ impl XMLWriter {
             write!(
                 &mut self.xmlfile,
                 r#"<si><t xml:space="preserve">{}</t></si>"#,
-                escape_si_data(string)
+                escape_si_data(&escape_xml_escapes(string))
             )
             .expect("Couldn't write to file");
         } else {
             write!(
                 &mut self.xmlfile,
                 "<si><t>{}</t></si>",
-                escape_si_data(string)
+                escape_si_data(&escape_xml_escapes(string))
             )
             .expect("Couldn't write to file");
         }
@@ -272,6 +273,16 @@ where
     Cow::Borrowed(original)
 }

+// Excel escapes control characters with _xHHHH_ and also escapes any literal
+// strings of that type by encoding the leading underscore. So "\0" -> _x0000_
+// and "_x0000_" -> _x005F_x0000_.
+fn escape_xml_escapes(si_string: &str) -> Cow<str> {
+    lazy_static! {
+        static ref XML_ESCAPE: Regex = Regex::new(r"(_x[0-9a-fA-F]{4}_)").unwrap();
+    }
+    XML_ESCAPE.replace_all(si_string, "_x005F$1")
+}
+