tchwork / utf8

Portable and performant UTF-8, Unicode and Grapheme Clusters for PHP
Apache License 2.0
627 stars 50 forks source link

Fixed ucwords to be functionally the same as in-built PHP version #39

Closed ghost closed 9 years ago

nicolas-grekas commented 9 years ago

As you can see, this breaks the tests with TurkishUtf8. ucfirst and lcfirst need no patch at all, they both have the behavior you expected them to have. For ucwords, "space" is not the only character that should trigger capitalization. You need a regexp to catch all first chars of "words". I suggest /\b(.)/ and applying mb_convert_case on the match

ghost commented 9 years ago

My bad, I didn't run the TurkishUtf8Test locally. Good point on the word boundaries. I've updated ucwords to use the regex you provided. The unit tests all pass locally now.

The reason I changed ucfirst and lcfirst is that previously they used ucwords and I wanted to ucwords to use ucfirst. My latest commits doesn't require that anymore, so I've changed them back. However, ucfirst and lcfirst both used non multibyte functions (substr and strlen). Is that correct?

One other tangential thing, in the ucfirst and lcfirst functions, you use iconv_substr. This function is about 60% slower than mb_substr. Can they be changed over?

nicolas-grekas commented 9 years ago

Yes, using strlen+substr is on purpose, because in this case, we don't need multi-byte handlig but for the first utf-8 char. Are you sure about iconv_strlen being faster than mb_strlen? I bet this is not true on every platform (Windows, Linux+libc, other). Do you have any facts?

nicolas-grekas commented 9 years ago

Thanks you @martinkiva