Make invalid Unicode data raise when encoding through Oj::Rails::Encoder

KJTsanaktsidis commented 5 months ago

This is a potential fix for https://github.com/ohler55/oj/issues/911. Currently, whether or not Oj::Rails::Encoder raises on invalid unicode data depends on the value of ActiveSupport.escape_html_entities_in_json. In order to accurately mimic the behaviour of stock Rails with the stock json gem, it should in fact raise an exception regardless.

I've so far deliberately copied rather than shared functionality that's shared between RailsEsc and RailsXEsc mode, because I wasn't quite sure how to factor the similarities out. We can leave it like this, or I'm happy to take pointers on a way to factor this down better.

I added a testcase for invalid Unicode to the Rails 6 & 7 encoding tests, and also parameterised the existing unicode-related tests to make sure they work correctly with both settings of ActiveSupport.escape_html_entities_in_json.

ohler55 commented 5 months ago

I'll have to spend a little time looking at the changes. It make me uncomfortable that some of the existing tests were removed. Did you run any benchmarks to see what impact on performance the change had?

KJTsanaktsidis commented 5 months ago

I definitely shuffled some tests around but I didn’t mean to remove any. What did I delete? I’ll fix that for sure :)

good call on benchmarking - I’ll put something together today.

ohler55 commented 5 months ago

Maybe it was the moving around part that made it appear as if some tests were removed. I will look more carefully.

ohler55 commented 4 months ago

Were you able to put together a benchmark and fix the clang formatting issues?

KJTsanaktsidis commented 4 months ago

Sorry, I haven’t gotten to that - I do plan to in the next couple of days!

KJTsanaktsidis commented 4 months ago

OK, I've fixed the clang-format problems, and I put together a benchmark: https://gist.github.com/KJTsanaktsidis/f85be084d61aca54f8493ab63fe0707f

Without this patch:

Calculating -------------------------------------
long_7bit_printable_string with RailsXEsc mode
                          2.133k (± 4.0%) i/s -     10.835k in   5.088119s
long_7bit_printable_string with RailsEsc mode
                          1.914k (± 3.3%) i/s -      9.750k in   5.099691s
long_ascii_string with RailsXEsc mode
                        163.294 (± 1.8%) i/s -    832.000 in   5.096756s
long_ascii_string with RailsEsc mode
                        175.841 (± 2.3%) i/s -    884.000 in   5.029591s
long_angle_bracket_string with RailsXEsc mode
                        275.981 (± 2.9%) i/s -      1.400k in   5.077551s
long_angle_bracket_string with RailsEsc mode
                          1.951k (± 3.3%) i/s -      9.945k in   5.103185s
long_utf8_multibyte_string with RailsXEsc mode
                        359.297 (± 2.2%) i/s -      1.800k in   5.012476s
long_utf8_multibyte_string with RailsEsc mode
                        654.515 (± 2.3%) i/s -      3.315k in   5.067395s

With this patch:

Calculating -------------------------------------
long_7bit_printable_string with RailsXEsc mode
                          2.120k (± 3.0%) i/s -     10.761k in   5.080555s
long_7bit_printable_string with RailsEsc mode
                          2.130k (± 3.0%) i/s -     10.812k in   5.081807s
long_ascii_string with RailsXEsc mode
                        169.397 (± 2.4%) i/s -    848.000 in   5.009189s
long_ascii_string with RailsEsc mode
                        179.139 (± 2.2%) i/s -    901.000 in   5.032772s
long_angle_bracket_string with RailsXEsc mode
                        284.548 (± 2.5%) i/s -      1.430k in   5.028596s
long_angle_bracket_string with RailsEsc mode
                          2.119k (± 3.1%) i/s -     10.750k in   5.079096s
long_utf8_multibyte_string with RailsXEsc mode
                        425.998 (± 2.3%) i/s -      2.142k in   5.031145s
long_utf8_multibyte_string with RailsEsc mode
                        426.126 (± 2.3%) i/s -      2.142k in   5.029241s

The only substantial difference is that the "lots of multibyte characters" case with this patch now takes the same amount of time regardless of using RailsEsc mode or RailsXEsc mode, whereas before it was faster in RailsEsc mode (because it wasn't validating any of the characters). But in non-multibyte-heavy workloads it seems the same.

ohler55 commented 4 months ago

Benchmarks look good. I'll start a more detailed review to get this merged.

ohler55 commented 4 months ago

LGTM other than one open question.

ohler55 commented 4 months ago

Thanks for the work. I know I was a little picky. Maybe too much so, sorry.

KJTsanaktsidis commented 4 months ago

Thanks for the work. I know I was a little picky. Maybe too much so, sorry.

Not at all! Thanks for your attention on this.

My best idea is... Rename rails_xss_friendly_size to size_t required_buffer_size_for_escaped_string(const uint8_t str, size_t len, const char cmap, bool *has_hi_out). Make it accept the cmap to use as a parameter, essentially, and pass hi as an explicit out-param rather than smuggling it out via the sign bit Call required_buffer_size_for_escaped_string in case RailsXEsc and case RailsEsc with different cmaps (rails_xss_friendly_chars vs rails_friendly_chars)

Do you want me to open another PR with those changes?

ohler55 / oj

Make invalid Unicode data raise when encoding through Oj::Rails::Encoder #912