[RFC] StringHelper: check if string as utf8 chars before doing utf8 conversion?

andrepereiradasilva commented 7 years ago

Steps to reproduce the issue

Was testing the performance of StringHelper calls. For what i understand they are mostly a way to make native php functions utf8 compatible.

But this makes some performance overhead since even pure ascii strings will try to do the utf8 conversion.

So, made some tests with 1000 iterations in 3 scenarios:

Current: using StringHelper method
Utf8CheckBefore: just use utf8_decode($str) === $str (or even StringHelper::is_ascii($str) === true) to decide if it should use the StringHelper method or the php native method
Native: PHP native method (with utf8 issues)

Used to strings to test:

ASCII String: "Without utf-8 chars"
UTF-8 String: "With utf-8 chars 纹身馆简介你好"

Results on some methods

strpos tests

> ASCII String: [Current: 1.308 ms | Utf8CheckBefore:  0.740 ms | Native: 0.140 ms]
> UTF-8 String: [Current: 1.313 ms | Utf8CheckBefore:  2.268 ms | Native: 0.132 ms]

strtoupper tests

> ASCII String [Current: 6.142 ms | Utf8CheckBefore:  0.339 ms | Native: 0.080 ms]
> UTF-8 String: [Current: 6.977 ms | Utf8CheckBefore:  7.769 ms | Native: 0.103 ms]

strrev tests

> ASCII String [Current: 3.864 ms | Utf8CheckBefore:  0.378 ms  | Native: 0.095 ms]
> UTF-8 String: [Current: 4.524 ms | Utf8CheckBefore:  5.275 ms  | Native: 0.111 ms]

Remarks

It's clear that the Utf8CheckBefore method is much faster in plain ascii strings. And a little slower in strings with utf-8 chars. Since many of the StringHelper usage across the cms (and overall) should be plain ascii shouldn't be better to check if the strign as utf-8 chars for deciding if it should use utf8 methods or not?

System information (as much as possible)

String 2.0 PHP 7.0.19 with mbstring extension (without is even slower)

Additional comments

@frankmayer would aprreciate your contribution here too

mbabker commented 7 years ago

I can't say I've benchmarked any of Joomla's code in quite some time, so in places where we can make improvements, I'm all for it.

One thing to keep in mind with the phputf8 library. Even though that version of it has been long abandoned, we've still treated it as an external resource and never tried to fork/patch it. There's a part of me that wants to sit down and look at maintained alternatives to replace it, or one day bite the bullet and just do a hard fork (including pulling the tests out of http://phputf8.cvs.sourceforge.net/viewvc/phputf8/utf8/ for the library code).

frankmayer commented 7 years ago

@andrepereiradasilva I had done some of those performance tests for the current CMS and came to about the same conclusions as you. This is a good step forward in performance and if I am not mistaken, I removed some stringhelper calls where it was sure that no utf-8 was involved.

Also, in my opinion, the next major version of Joomla should not use the phputf8 library. It should require the use of mbstring. Most - if not all - hosters already support the use of mbstring.

It is simple. we would gain a lot of performance by not having to call stringhelper methods which in turn have to find out if mbstring is loaded then call that or phputf8 etc, etc.

andrepereiradasilva commented 7 years ago

One thing to keep in mind with the phputf8 library. Even though that version of it has been long abandoned, we've still treated it as an external resource and never tried to fork/patch it.

if everything ok, don't need to change phputf8 library to make a change like this.

just something like this at the start in the StringHelper methods would do it

if (self::is_ascii($str) === true)
{
    return php_native_method($str, [...]);
}

mbabker commented 7 years ago

Also, in my opinion, the next major version of Joomla should not use the phputf8 library. It should require the use of mbstring. Most - if not all - hosters already support the use of mbstring.

If we can get away with it, cool. https://github.com/symfony/polyfill-mbstring exists so even if we go for "require mbstring" to some extent we can avoid making it a "hard" requirement.

frankmayer commented 7 years ago

Where to propose such "drastic" change?

frankmayer commented 7 years ago

@andrepereiradasilva

if everything ok, don't need to change phputf8 library to make a change like this.

just something like this at the start in the StringHelper methods would do it

if (self::is_ascii($str) === true) { return php_native_method($str, [...]); }

Yes, I think that would be ok.

mbabker commented 7 years ago

Where to propose such "drastic" change?

Prepare a patch, do some testing to ensure it is suitable for our needs, and open a PR. George and I can handle getting attention from the right people as required.

andrepereiradasilva commented 7 years ago

ok so i solved the issue (some test code i left behind ...) so some results where not correct. Updated them.

andrepereiradasilva commented 7 years ago

@frankmayer please note that for some tests even using native mb_* seems to be better with checking utf8 before

strtolower 1000 tests

ASCII String

[11.345 ms] StringHelper::strtolower($str)

[0.739 ms] utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);

[0.726 ms] utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);

[11.317 ms] mb_strtolower($str)

[0.157 ms] strtolower($str)

UTF-8 String

[13.100 ms] StringHelper::strtolower($str)

[13.813 ms] utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);

[13.383 ms] utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);

[12.788 ms] mb_strtolower($str)

[0.214 ms] strtolower($str)

frankmayer commented 7 years ago

@andrepereiradasilva yes, as long as the overall stringhelper overhead is not too expensive that should be a lot faster. Did you do these (mb_strtolower() vs strtolower() ) tests by invoking the stringhelper methods or by themselves (with or without checking if is_ascii)?

andrepereiradasilva commented 7 years ago

also shouldn't it also take in considerantion that if locale is utf-8, not an expert in this charset things so could be missing something, but seems there's even no need for those mb_* in those cases

$currentLocale = setlocale(LC_CTYPE, 0);

setlocale(LC_CTYPE, 'en_GB');
echo strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
echo mb_strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;

setlocale(LC_CTYPE, 'en_GB.UTF-8');
echo strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
echo mb_strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;

setlocale(LC_CTYPE, $currentLocale);

Results in:

������� ������ ����� ������ ��, ����������� ���� ������ �����
τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός
Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός
τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός

frankmayer commented 7 years ago

@mbabker thanks. Will take a shot at that, then.

andrepereiradasilva commented 7 years ago

Did you do these (mb_strtolower() vs strtolower() ) tests by invoking the stringhelper methods or by themselves (with or without checking if is_ascii)?

Both. the code used in there

frankmayer commented 7 years ago

@andrepereiradasilva

also shouldn't it also take in considerantion that if locale is utf-8, not an expert in this charset things so could be missing something, but seems there's even no need for those mb_* in those cases

Not sure if this works in all cases (languages and also methods) correctly, though. To be sure, I have always used mbstrings for international chars.

andrepereiradasilva commented 7 years ago

ok, but it's a lot faster not even using mb_* in those cases for UTF-8 strings. that i guess are most linux installations.

strtolower 1000 tests

ASCII String

[11.345 ms] StringHelper::strtolower($str)

[0.739 ms] utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);

[0.726 ms] utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);

[0.794 ms] utf8_decode($str) === $str || strpos(setlocale(LC_CTYPE, 0), 'UTF-8') !== false ? strtolower($str) : mb_strtolower($str);

[11.317 ms] mb_strtolower($str)

[0.157 ms] strtolower($str)

UTF-8 String

[13.100 ms] StringHelper::strtolower($str)

[13.813 ms] utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);

[13.383 ms] utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);

[1.506 ms] utf8_decode($str) === $str || strpos(setlocale(LC_CTYPE, 0), 'UTF-8') !== false ? strtolower($str) : mb_strtolower($str);

[12.788 ms] mb_strtolower($str)

[0.214 ms] strtolower($str)

frankmayer commented 7 years ago

If this is doable and cross platform it would be great! But, you are not dealing with linux only and also make sure that PHP uses the correct locale (Might be more than one installed) :smile_cat:

But my point was that it might work with some methods and not work with others. Same for languages. This would be needing a lot of testing to make sure, nothing breaks. But, yes, it would be a great performance improvement.

andrepereiradasilva commented 7 years ago

But, you are not dealing with linux only and also make sure that PHP uses the correct locale (Might be more than one installed) 😸

right, even if it's not possible in Mac/windows, if linux, as it seems, does this conversion when locale is utf-8 if should be no need to make these mb_* overhead in those systems.

But my point was that it might work with some methods and not work with others. Same for languages. This would be needing a lot of testing to make sure, nothing breaks.

of course

andrepereiradasilva commented 7 years ago

i see what you mean now. the setlocale is not trusty. just noticed in the test above of strtolower the first letter is still an uppercase with the utf-8 locale. so forget it.

nibra commented 3 years ago

We're preparing for replacing phputf8 (see #37). After that replacement, new benchmarks might show, if this optimisation still makes sense.

joomla-framework / string