mb_trim() inaccurate $characters default value

ranvis commented 7 months ago

Description

function mb_trim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}...\u{180E}", ?string $encoding = null): string {}

The default values for the parameter $characters of the new mb_trim functions are not accurate. When the very same value as the default is implied to $characters like the code below, mb_trim() does not necessarily work the same way, because $characters also depends on $encoding. The parameter should be typed as ?string $characters = null instead.

<?php
$strUtf8 = "\u{3042}\u{3000}"; // U+3000: fullwidth space
var_dump(mb_strlen(mb_trim($strUtf8))); // 1
var_dump(mb_strlen(mb_trim($strUtf8, encoding: 'UTF-8'))); // 1

mb_internal_encoding('Shift_JIS');
$strSjis = mb_convert_encoding($strUtf8, 'Shift_JIS', 'UTF-8');
var_dump(mb_strlen(mb_trim($strSjis))); // 1
var_dump(mb_strlen(mb_trim($strSjis, encoding: 'Shift_JIS'))); // 2

PHP Version

PHP dev-master

Operating System

No response

damianwadley commented 7 months ago

Looks like a bug, but it's not with the default character list - it's because giving characters and not giving characters works differently, and that should not be the case.

Relevant functions are all near each other: https://github.com/php/php-src/blob/4d51bfa2702c0a6d3375f6358b7c8d9611fd72e9/ext/mbstring/mbstring.c#L3058-L3146

Calls mb_trim_what_chars when there is a character list, said characters are enc->to_wchar-ified. Calls mb_trim_default_chars when there is no character list, default characters are not enc->to_wchar-ified.

These two pairs of outputs should be the same.

$input_utf8 = "\u{3000}abc\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, encoding: "UTF-8"))); // 3

$trimable_utf8 = "\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, $trimable_utf8, "UTF-8"))); // 3

//

$input_sjis = mb_convert_encoding($input_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, encoding: "Shift_JIS"))); // 7

$trimable_sjis = mb_convert_encoding($trimable_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, $trimable_sjis, "Shift_JIS"))); // 3

ranvis commented 7 months ago

Calls mb_trim_what_chars when there is a character list, said characters are enc->to_wchar-ified.

Yes.

Calls mb_trim_default_chars when there is no character list, default characters are not enc->to_wchar-ified.

Yes. The literal array in .c is already in wide (Unicode) form. Not sure if I read correctly. Let me describe some more to make my intention clearer.

If mb_trim is called with one argument from PHP, the C function receives just one parameter. The remaining two are not updated on ZPP; what is null.
If encoding: is specified in PHP but $characters is not, the default value of $characters defined in the stub (mbstring_arginfo.h) is copied to call_args before the internal function invocation, as internal functions cannot support named arguments directly. As a result, the function receives three parameters; what is what is defined as a default value in the stub. This causes the problem. While $encoding is an arbitrary value specified in userland, $characters is taken from the hard-coded stub, which is currently a sequence of UTF-8 bytes.

By changing the signature to ?string $characters = null, what is null on the second case too; the documentation might be able to say like:

When $characters is null, trimmed characters are as follows: U+0020, U+000C, U+000A, U+000D, U+0009, U+000B, U+0000, U+00A0, U+1680, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+2028, U+2029, U+202F, U+205F, U+3000, U+0085 and U+180E.

And it looks like what the code intended. The implementation works fine. However, the stub's default value declaration " \f\n\r\t\v\x00\u{00A0}... causes the problem.

damianwadley commented 7 months ago

Whether the parameter's default value is a string or is null is not the problem. The problem is that this

mb_trim($input, /* $characters is "\x20\x0C...", */ encoding: "Shift_JIS")

and this

mb_trim($input, "\x20\x0C...", "Shift_JIS")

behave differently. That is what should to be fixed. And there are multiple ways it could be fixed. Yes, that could also be a time to decide whether to make the parameter use null as the default.

CC @youkidearitai @alexdowad @nielsdos who were part of #12459, and where I see some brief conversation that touched on the subject of omitting the $characters list.

youkidearitai commented 7 months ago

If change to default parameter to $characters = null, I think need an RFC.

@ranvis What should we do solve this issue? I can't understand "inaccurate" that collect behavior (because my English is poor).

nielsdos commented 7 months ago

The problem is that the stub file is utf8, and so the unicode characters in the default value are encoded as utf8 too. That means when using a different character encoding, we have a mismatch in encoding.

Changing the argument type would indeed fix this. We need to bring it up on the mailing list and go from there.

ranvis commented 7 months ago

Thank you @.nielsdos for describing concisely :)

@youkidearitai Depending on how the function is called, the different internal function is called. See below:

--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -3139,8 +3139,10 @@ static void php_do_mb_trim(INTERNAL_FUNCTION_PARAMETERS, mb_trim_mode mode)
        }

        if (what) {
+               puts("mb_trim_what_chars()");
                RETURN_STR(mb_trim_what_chars(str, what, mode, enc));
        } else {
+               puts("mb_trim_default_chars()");
                RETURN_STR(mb_trim_default_chars(str, mode, enc));
        }
 }

<?php
echo "single argument: ";
mb_strlen(mb_trim("\u{3000}"));
echo "named argument: ";
mb_strlen(mb_trim("\u{3000}", encoding: 'UTF-8'));

This will print:

single argument: mb_trim_default_chars()
named argument: mb_trim_what_chars()

So, if user call the function using "named argument" without the $characters parameter specified, what is set to "\x20\x0c\x0a\x0d...\xe3\x80\x80\xc2\x85\xe1\xa0\x8e", the UTF-8 encoded string. But mb_trim expects what to be in encoding encoding. If encoding is not UTF-8, the above two calls works differently. I called it inaccuracy for the function doesn't actually use the advertised default value for "single argument" case. Or should I have called it discrepancy? Well, sorry if my subject confused you. I hope this clarifies your question.

nielsdos commented 7 months ago

I've made a PoC PR to fix this with the proposed solution (i.e. making the argument null by default): https://github.com/php/php-src/pull/13820. Maybe there are other, nicer, solutions possible. Note: this is a proof-of-concept, not a commitment to this change. It must be discussed on the mailing list. This is just to show that the proposed solution would work.

youkidearitai commented 7 months ago

Ah, I got it. $character is dependent to UTF-8, then encoding: SJIS is not compatible $character between $encoding, right? In a sense, according to specifications, but it's not intuitive. (And, not compatible mapping UTF-8 between SJIS, for example, \u{00A0} convert to ? (means can not convert))

There is also the problem of #13789, I think posted #13820 by @nielsdos seems to make sense. Thank you post the issue @ranvis , I will post and discussion PHP internals.

ranvis commented 7 months ago

OK. I'll try to keep an eye on it.

Probably I could have focused on making examples.

mb_internal_encoding('Shift_JIS');
$str = mb_convert_encoding('俄には信じ難い?', 'Shift_JIS', 'UTF-8');
var_dump(mb_convert_encoding(mb_trim($str, encoding: 'Shift_JIS'), 'UTF-8'));
// string(19) "には信じ難い?"

youkidearitai commented 6 months ago

Thanks all.

php / php-src