Closed LukeShu closed 23 hours ago
@LukeShu Thanks for your proposal. I agree to replace this with current implementation.
@LukeShu Thanks for your proposal. I agree to replace this with current implementation.
What is the plan here? Is it safe to assume you are going to merge this, therefore it should be fine use this patch in Fedora to avoid legal issues?
I didn't review this yet. Don't rush this.
Preliminary benchmarking (using your old generator_benchmark.rb
) is showing that my version is faster across the board:
I'm not sure what was up with those large "before" numbers, but now being more careful with CPU throttling and background processes and cron jobs, I'm seeing much smaller deltas (generally 1-3µs/op or 1-7% improvement; with the exception that generator2_benchmark.rb
's ASCII test got a big win). I definitely didn't make things slower.
I am seeing some higher standard deviations though.
For distros still shipping Ruby 2.7 packages, I've backported this to ruby-json 2.3.0 (the version bundled with Ruby 2.7.8).
Parabola is currently shipping:
ruby2.7 2.7.7-1.parabola1
https://repo.parabola.nu/other/ruby-libre/ruby-2.7.8-libre1.tar.gz (ruby 2.7.8 patched to use https://github.com/parabola-gnulinuxlibre/ruby-json/releases/tag/ruby-2.7.8-libre1)ruby 3.0.6-1.parabola1
https://repo.parabola.nu/other/ruby-libre/ruby-3.0.6-libre1.tar.gz (ruby 3.0.6 patched to use https://github.com/parabola-gnulinuxlibre/ruby-json/releases/tag/ruby-3.0.6-libre1 AKA https://github.com/parabola-gnulinuxlibre/ruby-json/releases/tag/v2.7.1-1.parabola1)ruby-json 2.7.1-1.parabola1
https://github.com/parabola-gnulinuxlibre/ruby-json/archive/6e75be64c896e093075ec99bf94a3f5fc576c283.tar.gz (https://github.com/parabola-gnulinuxlibre/ruby-json/releases/tag/v2.7.1-1.parabola1)Feel free to grab any of those tarballs or Git tags.
The parser code seems unrelated to the replacement.
The parser code seems unrelated to the replacement.
As can be seen if you look at it commit-by-commit, there was a small amount of CVTUTF code in parser.h
that abf962a9f52f00391787de870c490efbd4adfd52 drops; a few typedef
s and a few #defines
. The subsequent 8720b460db15e1e144d73bb54d50b290badf87c2 adjusts parser.h
and parser.rl
to those being gone. These are fairly trivial adjustments:
uint32_t
instead of the now-gone UTF32
(the commit message discusses whether it is safe to rely on uint32_t
having been defined)parser.rl:unescape_unicode()
now defines its own replacement_char
instead of the now-gone UNI_REPLACEMENT_CHAR
parser.rl:json_string_unescape()
now says 0xD800
instead of the now-gone UNI_SUR_HIGH_START
, with a comment explaining where 0xD800 comes from.
Then of course parser.c
is re-generated from parser.rl
.
I see that there is now a merge-conflict in generator.c
. I will rebase to resolve that tomorrow. I also have an idea for how I can get the stdev of the benchmarks down; I will benchmark that tomorrow and hopefully include it in the new version.
OK, updated. Sorry that took so long.
I'm not sure what changed; gcc or glibc's allocator or ruby's allocator, but I'm not seeing such increased variance in performance anymore. My idea for getting it down (which I've put on the lukeshu/no-cvtutf-prealloc
) does indeed improve the variance, but at IMO an unacceptable hit to average performance.
This is using the benchmark summaries generated by https://github.com/flori/json/pull/599.
Thanks for the new implementation.
From 1998 to 2007, the Unicode Consortium maintained a library called CVTUTF. In 2009, CVTUTF was removed from unicode.org, and the Unicode Consortium said that every version of CVTUTF had bugs, and that folks should use the ICU library instead.
CVTUTF was under a custom license that was not Free under the FSF's definition, not Open Source under the OSI's definition, and not GPL-compatible.
json/ext
uses code taken-from/based-on CVTUTF. This has caused much consternation among folks who care about any of those 3 things.So, I
json/ext
is based on CVTUTF,json/ext
,I hope that you'll find my version of
convert_UTF8_to_JSON
to be clearer and more maintainable.I have not benchmarked it, but I do not expect a significant performance difference. If I had to guess, I'd suspect that my UTF-8 decoder is slightly slower (I use
val & const == const
in an if/else chain, where I think CVTUTF used a[256]char
lookup table), while my JSON encoder is slightly faster (I suspect that by virtue of being simpler the compiler is better able to optimize it).Fixes #277