projectfluent / fluent-rs

Rust implementation of Project Fluent
https://projectfluent.org
Apache License 2.0
1.07k stars 96 forks source link

Consider readding unsafe slicing in the parser #82

Closed zbraniecki closed 3 years ago

zbraniecki commented 5 years ago

Thanks for removing the unsafe block here, at least for now. Would you mind filing a follow-up for discussing re-adding it, please? I'd like to look into it a bit more, given that we're likely protected from invalid UTF8 input by the sheer fact of accepting String into FluentResource::try_new.

Originally posted by @stasm in https://github.com/projectfluent/fluent-rs/pull/76

zbraniecki commented 5 years ago

This change would give us ~20% performance win:

Benchmarking parse/"simple": Collecting 100 samples in estimated 5.0453 s (2                                                                            parse/"simple"          time:   [17.874 us 18.020 us 18.193 us]
                        change: [-20.946% -19.166% -17.259%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
Benchmarking parse/"menubar": Collecting 100 samples in estimated 5.6131 s (                                                                            parse/"menubar"         time:   [155.88 us 157.36 us 159.22 us]
                        change: [-21.585% -19.878% -18.093%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Gnuplot not found, disabling plotting
zbraniecki commented 5 years ago

Using latest Rust nightly and the PR that I just put against master, the perf impact is as following:

parse/"simple": -15% parse/"preferences": -12.8% parse/"menubar": -15.3%

zbraniecki commented 5 years ago

@cyplo - on twitter you indicated you might be able to help us with fuzzing [0]. I may file a separate issue for the parser fuzzing prior to 1.0, but this is a particular aspect of it that I'd like to tackle now.

We have a parser that in theory should only cut at known character bounds. In result, we should be able to slice for free using unchecked, and we know that it brings sizable perf wins, but I don't know to verify the risk here.

Is that something you may know how to help us decide on?

[0] https://twitter.com/zbraniecki/status/1096383281929560064

cyplo commented 5 years ago

Hi, sure thing @zbraniecki ! Would it be possible to jump on a call to discuss this ? - you can DM me on Twitter

bjorn3 commented 5 years ago

get_slice should be unsafe. The start or end might not be at the start or end of a character, which makes the str invalid. While the parser probably doesn't currently violate that invariant, it could in the future. Making get_slice unsafe makes it clearer to its callers that it has an invariant which must not be violated. This will reduce the likelihood for future changes to violate the invariant.

zbraniecki commented 5 years ago

I honestly struggle to see any possibility of the future development of Fluent Syntax which would lead to us allowing for AST nodes to start/end on anything but start/end of characters.

zbraniecki commented 5 years ago

Moving to 0.8, since it doesn't affect the API

zbraniecki commented 3 years ago

This has been fixed with Slice in AST.

cyplo commented 3 years ago

@zbraniecki let me know if you're still interested in looking into fuzzing this together :) github@cyplo.net is a good starting point or @cyplo on pretty much anywhere