Open andrepereiradasilva opened 7 years ago
I can't say I've benchmarked any of Joomla's code in quite some time, so in places where we can make improvements, I'm all for it.
One thing to keep in mind with the phputf8 library. Even though that version of it has been long abandoned, we've still treated it as an external resource and never tried to fork/patch it. There's a part of me that wants to sit down and look at maintained alternatives to replace it, or one day bite the bullet and just do a hard fork (including pulling the tests out of http://phputf8.cvs.sourceforge.net/viewvc/phputf8/utf8/ for the library code).
@andrepereiradasilva I had done some of those performance tests for the current CMS and came to about the same conclusions as you. This is a good step forward in performance and if I am not mistaken, I removed some stringhelper calls where it was sure that no utf-8 was involved.
Also, in my opinion, the next major version of Joomla should not use the phputf8 library. It should require the use of mbstring. Most - if not all - hosters already support the use of mbstring.
It is simple. we would gain a lot of performance by not having to call stringhelper methods which in turn have to find out if mbstring is loaded then call that or phputf8 etc, etc.
One thing to keep in mind with the phputf8 library. Even though that version of it has been long abandoned, we've still treated it as an external resource and never tried to fork/patch it.
if everything ok, don't need to change phputf8 library to make a change like this.
just something like this at the start in the StringHelper methods would do it
if (self::is_ascii($str) === true)
{
return php_native_method($str, [...]);
}
Also, in my opinion, the next major version of Joomla should not use the phputf8 library. It should require the use of mbstring. Most - if not all - hosters already support the use of mbstring.
If we can get away with it, cool. https://github.com/symfony/polyfill-mbstring exists so even if we go for "require mbstring" to some extent we can avoid making it a "hard" requirement.
Where to propose such "drastic" change?
@andrepereiradasilva
if everything ok, don't need to change phputf8 library to make a change like this.
just something like this at the start in the StringHelper methods would do it
if (self::is_ascii($str) === true) { return php_native_method($str, [...]); }
Yes, I think that would be ok.
Where to propose such "drastic" change?
Prepare a patch, do some testing to ensure it is suitable for our needs, and open a PR. George and I can handle getting attention from the right people as required.
ok so i solved the issue (some test code i left behind ...) so some results where not correct. Updated them.
@frankmayer please note that for some tests even using native mb_* seems to be better with checking utf8 before
ASCII String
- [11.345 ms]
StringHelper::strtolower($str)
- [0.739 ms]
utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);
- [0.726 ms]
utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);
- [11.317 ms]
mb_strtolower($str)
- [0.157 ms]
strtolower($str)
UTF-8 String
- [13.100 ms]
StringHelper::strtolower($str)
- [13.813 ms]
utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);
- [13.383 ms]
utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);
- [12.788 ms]
mb_strtolower($str)
- [0.214 ms]
strtolower($str)
@andrepereiradasilva yes, as long as the overall stringhelper overhead is not too expensive that should be a lot faster. Did you do these (mb_strtolower() vs strtolower() ) tests by invoking the stringhelper methods or by themselves (with or without checking if is_ascii)?
also shouldn't it also take in considerantion that if locale is utf-8, not an expert in this charset things so could be missing something, but seems there's even no need for those mb_* in those cases
$currentLocale = setlocale(LC_CTYPE, 0);
setlocale(LC_CTYPE, 'en_GB');
echo strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
echo mb_strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
setlocale(LC_CTYPE, 'en_GB.UTF-8');
echo strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
echo mb_strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
setlocale(LC_CTYPE, $currentLocale);
Results in:
������� ������ ����� ������ ��, ����������� ���� ������ �����
τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός
Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός
τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός
@mbabker thanks. Will take a shot at that, then.
Did you do these (mb_strtolower() vs strtolower() ) tests by invoking the stringhelper methods or by themselves (with or without checking if is_ascii)?
Both. the code used in there
@andrepereiradasilva
also shouldn't it also take in considerantion that if locale is utf-8, not an expert in this charset things so could be missing something, but seems there's even no need for those mb_* in those cases
Not sure if this works in all cases (languages and also methods) correctly, though. To be sure, I have always used mbstrings for international chars.
ok, but it's a lot faster not even using mb_* in those cases for UTF-8 strings. that i guess are most linux installations.
ASCII String
- [11.345 ms]
StringHelper::strtolower($str)
- [0.739 ms]
utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);
- [0.726 ms]
utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);
- [0.794 ms]
utf8_decode($str) === $str || strpos(setlocale(LC_CTYPE, 0), 'UTF-8') !== false ? strtolower($str) : mb_strtolower($str);
- [11.317 ms]
mb_strtolower($str)
- [0.157 ms]
strtolower($str)
UTF-8 String
- [13.100 ms]
StringHelper::strtolower($str)
- [13.813 ms]
utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);
- [13.383 ms]
utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);
- [1.506 ms]
utf8_decode($str) === $str || strpos(setlocale(LC_CTYPE, 0), 'UTF-8') !== false ? strtolower($str) : mb_strtolower($str);
- [12.788 ms]
mb_strtolower($str)
- [0.214 ms]
strtolower($str)
If this is doable and cross platform it would be great! But, you are not dealing with linux only and also make sure that PHP uses the correct locale (Might be more than one installed) :smile_cat:
But my point was that it might work with some methods and not work with others. Same for languages. This would be needing a lot of testing to make sure, nothing breaks. But, yes, it would be a great performance improvement.
But, you are not dealing with linux only and also make sure that PHP uses the correct locale (Might be more than one installed) 😸
right, even if it's not possible in Mac/windows, if linux, as it seems, does this conversion when locale is utf-8 if should be no need to make these mb_* overhead in those systems.
But my point was that it might work with some methods and not work with others. Same for languages. This would be needing a lot of testing to make sure, nothing breaks.
of course
i see what you mean now. the setlocale is not trusty. just noticed in the test above of strtolower the first letter is still an uppercase with the utf-8 locale. so forget it.
We're preparing for replacing phputf8 (see #37). After that replacement, new benchmarks might show, if this optimisation still makes sense.
Steps to reproduce the issue
Was testing the performance of StringHelper calls. For what i understand they are mostly a way to make native php functions utf8 compatible.
But this makes some performance overhead since even pure ascii strings will try to do the utf8 conversion.
So, made some tests with 1000 iterations in 3 scenarios:
utf8_decode($str) === $str
(or evenStringHelper::is_ascii($str) === true
) to decide if it should use the StringHelper method or the php native methodUsed to strings to test:
Results on some methods
strpos tests
strtoupper tests
strrev tests
Remarks
It's clear that the
Utf8CheckBefore
method is much faster in plain ascii strings. And a little slower in strings with utf-8 chars. Since many of the StringHelper usage across the cms (and overall) should be plain ascii shouldn't be better to check if the strign as utf-8 chars for deciding if it should use utf8 methods or not?System information (as much as possible)
String 2.0 PHP 7.0.19 with mbstring extension (without is even slower)
Additional comments
@frankmayer would aprreciate your contribution here too