projectfluent / fluent-rs

Rust implementation of Project Fluent
https://projectfluent.org
Apache License 2.0
1.09k stars 98 forks source link

Deduplicate unescape_unicode implementation #307

Closed JasperDeSutter closed 7 months ago

JasperDeSutter commented 1 year ago

Shares the implementation of unescape_unicode and unescape_unicode_to_string. I've added a check to the corresponding test to make sure values that don't need unescaping are returned as Cow::Borrowed.

zbraniecki commented 1 year ago

This PR causes significant perf (26%) regression on unescape_to_str as it removes a bunch of carefully planted optimizations:

/Users/zibi/projects/fluent-rs/fluent-bundle〉cargo criterion unescape --features all-benchmarks
   Compiling fluent-syntax v0.11.0 (/Users/zibi/projects/fluent-rs/fluent-syntax)
   Compiling fluent-bundle v0.15.2 (/Users/zibi/projects/fluent-rs/fluent-bundle)
    Finished bench [optimized] target(s) in 2.24s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 2 filtered out; finished in 0.00s

Gnuplot not found, using plotters backend
construct/unescape      time:   [497.02 ns 498.05 ns 499.27 ns]
                        change: [-1.3500% -0.9283% -0.4818%] (p = 0.00 < 0.05)
                        Change within noise threshold.

resolve/unescape        time:   [396.24 ns 397.15 ns 398.17 ns]
                        change: [+3.1358% +3.6066% +4.0831%] (p = 0.00 < 0.05)
                        Performance has regressed.

resolve_to_str/unescape time:   [501.87 ns 502.93 ns 503.89 ns]
                        change: [+25.846% +26.437% +27.027%] (p = 0.00 < 0.05)
                        Performance has regressed.
JasperDeSutter commented 1 year ago

Thanks for checking the perf impact, I do run the benchmarks but they don't seem too reliable on my laptop. I'm guessing the system load/cpu scaling has more impact than my changes 😄 I've added back the check on line 92, not sure what else could have caused regressions.

zbraniecki commented 1 year ago

Does't help. It's still a ~25% regression. I skimmed through the code and I don't see a clear reason, but can reproduce it very reliably with cargo-criterion.

JasperDeSutter commented 1 year ago

I've optimized the implementation a bit to work away a bounds check from char boundary checking. It does change the approach quite a bit, by changing the string slice instead of holding a start pointer. Sadly the following is not possible without bringing back the bounds check: while let Some(ptr) = input.as_bytes().iter().position(|b| *b == b'\\')

zbraniecki commented 1 year ago

It didn't help. I was able to accomplish what you're proposing here, without perf impact:

//! A set of helper functions for unescaping Fluent unicode escape sequences.
//!
//! # Unicode
//!
//! Fluent supports UTF-8 in all FTL resources, but it also allows
//! unicode sequences to be escaped in [`String
//! Literals`](super::ast::InlineExpression::StringLiteral).
//!
//! Four byte sequences are encoded with `\u` and six byte
//! sqeuences using `\U`.
//! ## Example
//!
//! ```
//! use fluent_syntax::unicode::unescape_unicode_to_string;
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\u5bd2 Bar"),
//!     "Foo 寒 Bar"
//! );
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\U01F68A Bar"),
//!     "Foo 🚊 Bar"
//! );
//! ```
//!
//! # Other unescapes
//!
//! This also allows for a char `"` to be present inside an FTL string literal,
//! and for `\` itself to be escaped.
//!
//! ## Example
//!
//! ```
//! use fluent_syntax::unicode::unescape_unicode_to_string;
//!
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\\" Bar"),
//!     "Foo \" Bar"
//! );
//! assert_eq!(
//!     unescape_unicode_to_string("Foo \\\\ Bar"),
//!     "Foo \\ Bar"
//! );
//! ```
use std::borrow::Cow;
use std::char;
use std::fmt;

const UNKNOWN_CHAR: char = '�';

fn encode_unicode(s: Option<&str>) -> char {
    s.and_then(|s| u32::from_str_radix(s, 16).ok().and_then(char::from_u32))
        .unwrap_or(UNKNOWN_CHAR)
}

fn unescape<W>(w: &mut W, input: &str) -> Result<bool, std::fmt::Error>
where
    W: fmt::Write,
{
    let bytes = input.as_bytes();

    let mut start = 0;
    let mut ptr = 0;

    while let Some(b) = bytes.get(ptr) {
        if b != &b'\\' {
            ptr += 1;
            continue;
        }
        if start != ptr {
            w.write_str(&input[start..ptr])?;
        }

        ptr += 1;

        let new_char = match bytes.get(ptr) {
            Some(b'\\') => '\\',
            Some(b'"') => '"',
            Some(u @ b'u') | Some(u @ b'U') => {
                let seq_start = ptr + 1;
                let len = if u == &b'u' { 4 } else { 6 };
                ptr += len;
                encode_unicode(input.get(seq_start..seq_start + len))
            }
            _ => UNKNOWN_CHAR,
        };
        ptr += 1;
        w.write_char(new_char)?;
        start = ptr;
    }
    if start == 0 {
        return Ok(false);
    }

    if start != ptr {
        w.write_str(&input[start..ptr])?;
    }
    Ok(true)
}

/// Unescapes to a writer without allocating.
///
/// ## Example
///
/// ```
/// use fluent_syntax::unicode::unescape_unicode;
///
/// let mut s = String::new();
/// unescape_unicode(&mut s, "Foo \\U01F60A Bar");
/// assert_eq!(s, "Foo 😊 Bar");
/// ```
pub fn unescape_unicode<W>(w: &mut W, input: &str) -> fmt::Result
where
    W: fmt::Write,
{
    if unescape(w, input)? {
        return Ok(());
    }
    w.write_str(input)
}

/// Unescapes to a `Cow<str>` optionally allocating.
///
/// ## Example
///
/// ```
/// use fluent_syntax::unicode::unescape_unicode_to_string;
///
/// assert_eq!(
///     unescape_unicode_to_string("Foo \\U01F60A Bar"),
///     "Foo 😊 Bar"
/// );
/// ```
pub fn unescape_unicode_to_string(input: &str) -> Cow<str> {
    let mut result = String::new();

    let owned = unescape(&mut result, input).expect("String write methods don't Err");

    if owned {
        Cow::Owned(result)
    } else {
        Cow::Borrowed(input)
    }
}
JasperDeSutter commented 1 year ago

Last try on my method! Shouldn't be possible to optimize much further.. I'm not able to spot the differences in the code you posted with my second commit, is it just moving the functions?

Also, I'm not too sure we're benchmarking correctly with unescape.ftl being very small and not being optimized for the more common case of having no escape sequences.

zbraniecki commented 1 year ago

Your code vs my snippet:

/Users/zibi/projects/fluent-rs/fluent-bundle〉cargo criterion unescape --features all-benchmarks
   Compiling fluent-syntax v0.11.0 (/Users/zibi/projects/fluent-rs/fluent-syntax)
   Compiling fluent-bundle v0.15.2 (/Users/zibi/projects/fluent-rs/fluent-bundle)
    Finished bench [optimized] target(s) in 2.66s

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 2 filtered out; finished in 0.00s

Gnuplot not found, using plotters backend
construct/unescape      time:   [498.08 ns 499.65 ns 501.28 ns]
                        change: [+0.2029% +0.6591% +1.1052%] (p = 0.01 < 0.05)
                        Change within noise threshold.

resolve/unescape        time:   [384.18 ns 385.82 ns 387.91 ns]
                        change: [-0.3062% +0.1708% +0.6949%] (p = 0.51 > 0.05)
                        No change in performance detected.

resolve_to_str/unescape time:   [484.81 ns 486.70 ns 488.78 ns]
                        change: [+20.621% +21.180% +21.755%] (p = 0.00 < 0.05)
                        Performance has regressed.

Unexpected error while launching valgrind. Error: No such file or directory (os error 2)

What is your goal here? You filed a PR to "deduplicate unescape_unicode implementation" and I provided you with deduplicated implementations that preserve performance.

What else are you trying to accomplish in this PR and can you separate it out?

Here's the diff between your version and mine that causes 25% regression:

diff --git a/fluent-syntax/src/unicode.rs b/fluent-syntax/src/unicode.rs
index 137e17e..33f3c95 100644
--- a/fluent-syntax/src/unicode.rs
+++ b/fluent-syntax/src/unicode.rs
@@ -49,10 +49,8 @@ use std::fmt;

 const UNKNOWN_CHAR: char = '�';

-fn encode_unicode(s: &str) -> char {
-    u32::from_str_radix(s, 16)
-        .ok()
-        .and_then(char::from_u32)
+fn encode_unicode(s: Option<&str>) -> char {
+    s.and_then(|s| u32::from_str_radix(s, 16).ok().and_then(char::from_u32))
         .unwrap_or(UNKNOWN_CHAR)
 }

@@ -77,64 +75,47 @@ where
     w.write_str(input)
 }

-fn find_escape_sequence<'a, W>(
-    input: &'a str,
-    w: &mut W,
-) -> Result<Option<&'a str>, std::fmt::Error>
+fn unescape<W>(w: &mut W, input: &str) -> Result<bool, std::fmt::Error>
 where
     W: fmt::Write,
 {
-    if input.is_empty() {
-        return Ok(None);
-    }
+    let bytes = input.as_bytes();

+    let mut start = 0;
     let mut ptr = 0;
-    while input.as_bytes()[ptr] != b'\\' {
-        ptr += 1;
-        if ptr >= input.len() {
-            return Ok(None);
-        }
-    }

-    if ptr > 0 {
-        let (a, b) = input.split_at(ptr);
-        w.write_str(a)?;
-        return Ok(Some(b));
-    }
-    Ok(Some(input))
-}
-
-fn unescape<W>(w: &mut W, mut input: &str) -> Result<bool, std::fmt::Error>
-where
-    W: fmt::Write,
-{
-    let start_len = input.len();
+    while let Some(b) = bytes.get(ptr) {
+        if b != &b'\\' {
+            ptr += 1;
+            continue;
+        }
+        if start != ptr {
+            w.write_str(&input[start..ptr])?;
+        }

-    while let Some(sequence) = find_escape_sequence(input, w)? {
-        let mut offset = 2;
+        ptr += 1;

-        let new_char = match sequence.as_bytes().get(1) {
+        let new_char = match bytes.get(ptr) {
             Some(b'\\') => '\\',
             Some(b'"') => '"',
             Some(u @ b'u') | Some(u @ b'U') => {
-                let len = if *u == b'u' { 4 } else { 6 };
-                offset += len;
-                sequence.get(2..offset).map_or(UNKNOWN_CHAR, encode_unicode)
+                let seq_start = ptr + 1;
+                let len = if u == &b'u' { 4 } else { 6 };
+                ptr += len;
+                encode_unicode(input.get(seq_start..seq_start + len))
             }
             _ => UNKNOWN_CHAR,
         };
+        ptr += 1;
         w.write_char(new_char)?;
-
-        input = sequence.get(offset..).unwrap_or_default();
+        start = ptr;
     }
-
-    if input.len() == start_len {
-        // nothing was unescaped
+    if start == 0 {
         return Ok(false);
     }

-    if !input.is_empty() {
-        w.write_str(input)?;
+    if start != ptr {
+        w.write_str(&input[start..ptr])?;
     }
     Ok(true)
 }

I'm not able to spot the differences in the code you posted with my second commit, is it just moving the functions?

I don't think it is. I suspect some parts of your logic are falling off the cliff. We can keep debugging, but I'd like to understand what is the goal.

Also, I'm not too sure we're benchmarking correctly with unescape.ftl being very small and not being optimized for the more common case of having no escape sequences.

I strongly encourage you to not mix algo changes with benchmark evaluation. It's tempting to adjust the benchmark to show what you want, but that's precisely against what it's here for. I also happen to disagree with you - the benchmark is testing exactly what we want to test with heavier weight on unescaping because that's costly, while pass-through should be as close to no-op as possible.

To sum it up, I recommend: 1) Take my snippet, update PR, let's land this. 2) Start a new PR where you can fiddle with additional changes and justify them please. 3) If you believe the benchmark is wrong, file an issue/PR and discuss it there.

zbraniecki commented 1 year ago

I kept digging a bit more into it - it's actually a red-herring.

The changes between your code and mine are not impacting perf, but your branch is not up to date with main:

diff --git a/fluent-bundle/src/bundle.rs b/fluent-bundle/src/bundle.rs
index 63b00b2..0b74c22 100644
--- a/fluent-bundle/src/bundle.rs
+++ b/fluent-bundle/src/bundle.rs
@@ -494,7 +494,7 @@ impl<R, M> FluentBundle<R, M> {
     {
         let mut scope = Scope::new(self, args, Some(errors));
         let value = pattern.resolve(&mut scope);
-        value.as_string(&scope)
+        value.into_string(&scope)
     }
     /// Makes the provided rust function available to messages with the name `id`. See

This change is 100% responsible for the resolve_to_str/unescape regression.

Please, rebase on top of main!

I still think it's better to separate changes into smaller PRs so recommendation stands, but now you don't have to worry about your further changes affecting perf anymore!

JasperDeSutter commented 1 year ago

That explains why I wasn't seeing the same performance impact, I was comparing to main without that into_string commit included.. I've removed the optimization commit.

gregtatum commented 1 year ago

This is still on my list to look at, I only had time today to look at some smaller PRs.