Use faster UTF8 encoding in `Content.write()`

lorban commented 1 week ago

Fixes #12469

lorban commented 1 week ago

@joakime is it possible to create a String object containing invalid UTF8? All I've seen in our tests is to create invalid UTF8 in byte arrays then the String constructor is used to build the string, which does apply some correction. Did I get that right?

lorban commented 1 week ago

I found one other place where we use encode that could be replaced with getBytes/wrap.

lorban commented 1 week ago

For future reference, here is the benchmark's report:

Benchmark                                          (locale)   Mode  Cnt         Score         Error   Units
Utf8Benchmark.testEncode                              ASCII  thrpt   10   1885900.209 ±   12517.384   ops/s
Utf8Benchmark.testEncode:gc.alloc.rate                ASCII  thrpt   10      1121.696 ±       7.530  MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm           ASCII  thrpt   10       624.007 ±       0.001    B/op
Utf8Benchmark.testEncode:gc.count                     ASCII  thrpt   10        19.000                counts
Utf8Benchmark.testEncode:gc.time                      ASCII  thrpt   10        23.000                    ms
Utf8Benchmark.testEncode                                 FR  thrpt   10   1310399.805 ±   12798.866   ops/s
Utf8Benchmark.testEncode:gc.alloc.rate                   FR  thrpt   10       789.489 ±       7.739  MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm              FR  thrpt   10       632.011 ±       0.001    B/op
Utf8Benchmark.testEncode:gc.count                        FR  thrpt   10        14.000                counts
Utf8Benchmark.testEncode:gc.time                         FR  thrpt   10        18.000                    ms
Utf8Benchmark.testEncode                                 JA  thrpt   10    814449.918 ±   11152.653   ops/s
Utf8Benchmark.testEncode:gc.alloc.rate                   JA  thrpt   10      2925.414 ±      40.086  MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm              JA  thrpt   10      3768.017 ±       0.001    B/op
Utf8Benchmark.testEncode:gc.count                        JA  thrpt   10        33.000                counts
Utf8Benchmark.testEncode:gc.time                         JA  thrpt   10        47.000                    ms
Utf8Benchmark.testWrapGetBytes                        ASCII  thrpt   10  39417563.752 ± 1256275.047   ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate          ASCII  thrpt   10     19538.322 ±     623.689  MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm     ASCII  thrpt   10       520.000 ±       0.001    B/op
Utf8Benchmark.testWrapGetBytes:gc.count               ASCII  thrpt   10        71.000                counts
Utf8Benchmark.testWrapGetBytes:gc.time                ASCII  thrpt   10       144.000                    ms
Utf8Benchmark.testWrapGetBytes                           FR  thrpt   10   3434889.274 ±   64716.469   ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate             FR  thrpt   10      4819.736 ±      90.934  MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm        FR  thrpt   10      1472.004 ±       0.001    B/op
Utf8Benchmark.testWrapGetBytes:gc.count                  FR  thrpt   10        37.000                counts
Utf8Benchmark.testWrapGetBytes:gc.time                   FR  thrpt   10        58.000                    ms
Utf8Benchmark.testWrapGetBytes                           JA  thrpt   10   1399081.733 ±   47082.158   ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate             JA  thrpt   10      3595.402 ±     121.188  MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm        JA  thrpt   10      2696.010 ±       0.001    B/op
Utf8Benchmark.testWrapGetBytes:gc.count                  JA  thrpt   10        32.000                counts
Utf8Benchmark.testWrapGetBytes:gc.time                   JA  thrpt   10        46.000                    ms

lorban commented 1 week ago

@gregw HttpOutput.print() goes into great length to pool the encoder and to to detect encoding errors like overflows/underflows.

We could theoretically replace all that with a much simpler String.getBytes(Charset), which could improve perf but may not work as expected w.r.t encoding. @joakime what's your opinion on that one?

joakime commented 1 week ago

@joakime what's your opinion on that one?

If the behavior of the API to the users is maintained, then I'm in favor of the change. Is it possible to use HttpOutput.print() with partial code points? (meaning a print() is called which starts the code points, then a subsequent print() results in finishing the code point) If so, then the String.getBytes(Charset) wouldn't work for us.

lorban commented 1 week ago

I'm going to give HttpOutput.print() a try in another PR, as it isn't trivial to change but may work and be worth the effort.

jetty / jetty.project

Use faster UTF8 encoding in `Content.write()` #12475