ruby-i18n / i18n

Internationalization (i18n) library for Ruby
MIT License
986 stars 411 forks source link

Improve TOKENIZER by 23% #668

Closed kbrock closed 1 year ago

kbrock commented 1 year ago

From what I can see, this is done in linear time: 4*O(n) This tokenizer change converts that to something a little quicker: 3*O(n)

Seems that not using a capture group and something other than split would be a big win. Other than that, the changes were meager.

I used https://regex101.com/ (and pcre2) to evaluate the cost of the TOKENIZER. I verified with cruby 3.0.6 (by eyeball - nothing too extensive)

I tried a few changes to the regular expression and the example in the issue. I was able to speed up by 23% with minimal changes to the codebase. Other savings were to be had but request feedback before going that route.

 /(%%\{[^\}]+\}|%\{[^\}]+\})/ =~ '%{{'*9999)+'}'

/(%%\{[^\}]+\}|%\{[^\}]+\})/ ==> 129,990 steps
/(%?%\{[^\}]+\})/            ==> 129,990 steps
/(%%?\{[^\}]+\})/            ==>  99,992 steps (simple savings of 25%) <===
/(%%?\{[^%}{]+\})/           ==>  89,993 steps (limiting variable contents has minimal gains)

There really isn't much room for improvement overall. The null/simple cases seem to speak for themselves:

/x/ =~ '%{{'*9999)+'}'
/x/                          ==>  29,998 steps
/(x)/                        ==>  59,996 steps
/%{x/                        ==>  49,998 steps
/(%%?{x)/                    ==>  89,993 steps

And the plain string that doesn't fair too much worse than the specially crafted string. So this suggests that if there is a vulnerability in the regular expression, it is not expressed by this example. (especially since they all seem to be linear)

/x/ =~ 'abb'*9999+'c'

/x/                          ==>  29,999
/(%%?{x)/                    ==>  59,998
/(%%?\{[^\}]+\})/            ==>  59,998
/(%%\{[^\}]+\}|%\{[^\}]+\})/ ==>  89,997

per https://github.com/ruby-i18n/i18n/issues/667

radar commented 1 year ago

Thank you very much :)