Closed lorban closed 1 week ago
@joakime is it possible to create a String
object containing invalid UTF8? All I've seen in our tests is to create invalid UTF8 in byte arrays then the String
constructor is used to build the string, which does apply some correction. Did I get that right?
I found one other place where we use encode
that could be replaced with getBytes/wrap
.
For future reference, here is the benchmark's report:
Benchmark (locale) Mode Cnt Score Error Units
Utf8Benchmark.testEncode ASCII thrpt 10 1885900.209 ± 12517.384 ops/s
Utf8Benchmark.testEncode:gc.alloc.rate ASCII thrpt 10 1121.696 ± 7.530 MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm ASCII thrpt 10 624.007 ± 0.001 B/op
Utf8Benchmark.testEncode:gc.count ASCII thrpt 10 19.000 counts
Utf8Benchmark.testEncode:gc.time ASCII thrpt 10 23.000 ms
Utf8Benchmark.testEncode FR thrpt 10 1310399.805 ± 12798.866 ops/s
Utf8Benchmark.testEncode:gc.alloc.rate FR thrpt 10 789.489 ± 7.739 MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm FR thrpt 10 632.011 ± 0.001 B/op
Utf8Benchmark.testEncode:gc.count FR thrpt 10 14.000 counts
Utf8Benchmark.testEncode:gc.time FR thrpt 10 18.000 ms
Utf8Benchmark.testEncode JA thrpt 10 814449.918 ± 11152.653 ops/s
Utf8Benchmark.testEncode:gc.alloc.rate JA thrpt 10 2925.414 ± 40.086 MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm JA thrpt 10 3768.017 ± 0.001 B/op
Utf8Benchmark.testEncode:gc.count JA thrpt 10 33.000 counts
Utf8Benchmark.testEncode:gc.time JA thrpt 10 47.000 ms
Utf8Benchmark.testWrapGetBytes ASCII thrpt 10 39417563.752 ± 1256275.047 ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate ASCII thrpt 10 19538.322 ± 623.689 MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm ASCII thrpt 10 520.000 ± 0.001 B/op
Utf8Benchmark.testWrapGetBytes:gc.count ASCII thrpt 10 71.000 counts
Utf8Benchmark.testWrapGetBytes:gc.time ASCII thrpt 10 144.000 ms
Utf8Benchmark.testWrapGetBytes FR thrpt 10 3434889.274 ± 64716.469 ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate FR thrpt 10 4819.736 ± 90.934 MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm FR thrpt 10 1472.004 ± 0.001 B/op
Utf8Benchmark.testWrapGetBytes:gc.count FR thrpt 10 37.000 counts
Utf8Benchmark.testWrapGetBytes:gc.time FR thrpt 10 58.000 ms
Utf8Benchmark.testWrapGetBytes JA thrpt 10 1399081.733 ± 47082.158 ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate JA thrpt 10 3595.402 ± 121.188 MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm JA thrpt 10 2696.010 ± 0.001 B/op
Utf8Benchmark.testWrapGetBytes:gc.count JA thrpt 10 32.000 counts
Utf8Benchmark.testWrapGetBytes:gc.time JA thrpt 10 46.000 ms
@gregw HttpOutput.print()
goes into great length to pool the encoder and to to detect encoding errors like overflows/underflows.
We could theoretically replace all that with a much simpler String.getBytes(Charset)
, which could improve perf but may not work as expected w.r.t encoding. @joakime what's your opinion on that one?
@joakime what's your opinion on that one?
If the behavior of the API to the users is maintained, then I'm in favor of the change.
Is it possible to use HttpOutput.print() with partial code points? (meaning a print() is called which starts the code points, then a subsequent print() results in finishing the code point)
If so, then the String.getBytes(Charset)
wouldn't work for us.
I'm going to give HttpOutput.print()
a try in another PR, as it isn't trivial to change but may work and be worth the effort.
Fixes #12469