php / php-src

The PHP Interpreter
https://www.php.net
Other
37.99k stars 7.73k forks source link

mb_convert_case($str, MB_CASE_TITLE, "UTF-8"); doesn't convert the Greek last letter to ς but to σ #8096

Closed ThanasisMpalatsoukas closed 1 year ago

ThanasisMpalatsoukas commented 2 years ago

Description

The following code:

<?php
echo mb_convert_case("ΚΑΛΗΣΠΕΡΑ ΣΑΣ", MB_CASE_TITLE, "UTF-8"); 

Resulted in this output:

Καλησπερα Σασ

But I expected this output instead:

Καλησπερα Σας

PHP Version

7.4.3

Operating System

ubuntu 20.4

cmb69 commented 2 years ago

Confirmed: https://3v4l.org/eYEWE. Looks like MBString doesn't cater to trailing Σs.

Note that PHP 7.4 is no longer actively maintained, so a fix can target PHP-8.0 at most.

cmb69 commented 2 years ago

Well, it seems that none of the conditional mappings are implemented in MBString. As such, this is more like a feature request. However, I'm not sure whether MBString should implement these conditional mappings at all; the language-sensitive mappings would require some way to specify the language, and while mb_language() could be used for that, it is currently only used for email messages, and as such could introduce subtle BC breaks. The language-insensitive mappings would require some kind of lookahead, and as such might not fit well to mbfl. @alexdowad, thoughts?

nikic commented 2 years ago

Ideally mbstring would implement final sigma handling, I didn't do that at the time for implementation complexity reasons.

alexdowad commented 2 years ago

I'd like to apologize that I have taken a long time to investigate this issue. Just looking at it now...

alexdowad commented 2 years ago

Hmm, so according to the Unicode 'special casing' table, the only languages which have unique casing requirements are Lithuanian, Turkish, and Azeri. For U+03A3, there is no language condition mentioned... the condition for special casing is just Final_Sigma.

I wonder how many Lithuanian, Turkish, and Azeri PHP users there are? PHP is a popular programming language, so there are almost certainly a few.

Because the "final sigma" rule is not indicated as being language-dependent in the special casing table, should we make it apply regardless of language or locale?

Also, it looks like the "final sigma" rule only impacts the lowercase conversion code. (U+03A3 maps to itself for title and upper case; it's only lowercase where it changes.)

We are currently moving towards doing all operations "a buffer at a time" rather than "a codepoint at a time". Working in this way saves an enormous amount of overhead and makes everything several times faster. After we convert the upper/lower casing code to follow this pattern, I think lookahead would not be such a big problem. We just need to be careful about the case where sigma appears as the last codepoint in a buffer.

Sounds good?

alexdowad commented 2 years ago

If we have a significant number of e.g. Lithuanian users, I wonder if we might introduce a variant of mb_strtolower called mb_strtolower_lithuanian or something like that.

I think auto-detecting the user's locale and changing behavior based on it is fraught with problems. First, the locale on the computer where software is tested may not be the same as the locale where it runs. Second, for web-based software, the locale on the web server may not be the same as the locale of the users. When you change behavior based on the machine's locale, you are introducing an extra piece of implicit global state, with all the problems which implicit global state presents in general.

Better to let the programmer be explicit about what they want. If we did have mb_strtolower_lithuanian, I don't expect that most mbstring users would ever take notice of it, but Lithuanians who were writing web apps intended for use by other Lithuanians might.

Of course, said Lithuanians could also implement their language's casing rules in user code, if they wanted to.

Maybe mb_strtolower_lithuanian is a terrible idea. But if we are going to implement those language-specific rules, definitely the choice of rules needs to be made explicit. mb_language is explicit, but I don't know if it is the right choice for this purpose. As @cmb69 said, it is already used for another purpose.

I trust @nikic and @cmb69 on whether implementing these features is useful or not, and what the interface should be like. I'm happy to do the actual implementation.

By the way, @nikic and @cmb69, since you are both more experienced C programmers than I, can I ask whether you agree that including "locales" in POSIX/ANSI C was probably a bad idea? It seems to me that localization should probably have been handled at a higher level of the stack than libc.

It would also be appreciated if you guys can review #8257, which will help to keep mbstring moving forward.

nikic commented 2 years ago

Yes, we certainly do not depend on locales for this purpose. In fact, we have made good progress on eliminating locale-dependent functionality, with the most recent step being https://wiki.php.net/rfc/strtolower-ascii.

I think basing this off mbstring.language is viable, though we'd want to make sure that the language can also be passed explicitly to the relevant case-conversion functions, so as to avoid dependence on global state. I don't think that adding per-language functions makes sense.

Supporting Turkish case folding rules at least would probably be of some value, as the way dotted i is handled under default case folding is often perceived as undesirable (creates I with an additional combining mark). I believe we already special-case one Turkish legacy encoding in this regard.

By the way, @nikic and @cmb69, since you are both more experienced C programmers than I, can I ask whether you agree that including "locales" in POSIX/ANSI C was probably a bad idea? It seems to me that localization should probably have been handled at a higher level of the stack than libc.

Yes, locales are easily one of the contenders for "worst design mistake in C", though of course there is a lot of competition in this space ;)

alexdowad commented 2 years ago

I think basing this off mbstring.language is viable, though we'd want to make sure that the language can also be passed explicitly to the relevant case-conversion functions, so as to avoid dependence on global state. I don't think that adding per-language functions makes sense.

OK, cool. Would adding an optional 3rd argument to mb_strtolower et al require an RFC or something?

The languages recognized by mbstring are English, German, Armenian, Japanese, Korean, Russian, Turkish, Ukrainian, and Chinese. Then we have languages called "neutral" and "universal". 🤷‍♂️

So we would need to add Lithuanian and Azeri there?

Supporting Turkish case folding rules at least would probably be of some value, as the way dotted i is handled under default case folding is often perceived as undesirable (creates I with an additional combining mark). I believe we already special-case one Turkish legacy encoding in this regard.

OK. More homework for me to look into that. I guess we would keep the same behavior to avoid a BC break, even if mbstring.language is not Turkish?

Yes, locales are easily one of the contenders for "worst design mistake in C", though of course there is a lot of competition in this space ;)

OK, thanks for confirming my feelings on that issue.

Side note: Have you ever seen the implementation of the POSIX/C locale APIs in musl libc? It's not to be missed.

cmb69 commented 2 years ago

If we are not sure about the language-sensitive mappings, maybe we should just go for the language-insensitive mappings for now.

alexdowad commented 2 years ago

If we are not sure about the language-sensitive mappings, maybe we should just go for the language-insensitive mappings for now.

I'm also cool with that.

Again, so as to move forward with this one, it will be much appreciated if my currently open PR for mbstring can be reviewed. I think it's in the critical path.

Kos-M commented 2 years ago

This affected me also. Quick workarround:

<?php

    /**
     * Splits given text to words seperated by space ,
     * then checks if there is σ instead of ς in last character of each word.
     * Returns corrected text.
     */
    function fixTrailingSigmas($textToCorrect){
        $replacement = "ς";    // What you want to replace the last character with
        $words  = explode(  ' ', $textToCorrect );
        $corrected = $textToCorrect;
        $ans = [];
        foreach ($words as $w ){
            $fixed = $w;
            if ( mb_strpos( mb_substr($w, -1)  , 'σ') === 0 ){
                $fixed =  mb_substr($w, 0, -1).$replacement;
            } 
            array_push( $ans , $fixed);
        }
        $corrected = implode( ' ', $ans);
        return $corrected;
    }
noraj commented 1 year ago

I'm not sure whether MBString should implement these conditional mappings at all; the language-sensitive mappings would require some way to specify the language

For full case mapping, you have:

I wonder how many Lithuanian, Turkish, and Azeri PHP users there are? PHP is a popular programming language, so there are almost certainly a few.

Even if there isn't any PHP dev in those countries (and I believe there are), international companies like Facebook have users all around the worlds including in those countries.

If we have a significant number of e.g. Lithuanian users, I wonder if we might introduce a variant of mb_strtolower called mb_strtolower_lithuanian or something like that.

In other languages you have either a language independent and a language dependent version of the function or you have only one function that is language dependent. In both case you most always have a locale autodetection and an optional argument where you can specify the locale to override the autodetection. Actually in PHP none of the mb_ string manipulation functions take into consideration the locale so you would have to call locale_get_default before and after each mb_ call that wouldn't be viable, so most mb_ functions would need to be updated to have a locale optional argument added. I don't think adding a variant per language like mb_strtolower_lithuanian is reasonable.