Closed ranvis closed 6 months ago
Looks like a bug, but it's not with the default character list - it's because giving characters and not giving characters works differently, and that should not be the case.
Relevant functions are all near each other: https://github.com/php/php-src/blob/4d51bfa2702c0a6d3375f6358b7c8d9611fd72e9/ext/mbstring/mbstring.c#L3058-L3146
Calls mb_trim_what_chars
when there is a character list, said characters are enc->to_wchar
-ified.
Calls mb_trim_default_chars
when there is no character list, default characters are not enc->to_wchar
-ified.
These two pairs of outputs should be the same.
$input_utf8 = "\u{3000}abc\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, encoding: "UTF-8"))); // 3
$trimable_utf8 = "\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, $trimable_utf8, "UTF-8"))); // 3
//
$input_sjis = mb_convert_encoding($input_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, encoding: "Shift_JIS"))); // 7
$trimable_sjis = mb_convert_encoding($trimable_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, $trimable_sjis, "Shift_JIS"))); // 3
Calls
mb_trim_what_chars
when there is a character list, said characters areenc->to_wchar
-ified.
Yes.
Calls
mb_trim_default_chars
when there is no character list, default characters are notenc->to_wchar
-ified.
Yes. The literal array in .c is already in wide (Unicode) form. Not sure if I read correctly. Let me describe some more to make my intention clearer.
what
is null.encoding:
is specified in PHP but $characters
is not, the default value of $characters
defined in the stub (mbstring_arginfo.h) is copied to call_args before the internal function invocation, as internal functions cannot support named arguments directly. As a result, the function receives three parameters; what
is what is defined as a default value in the stub.
This causes the problem. While $encoding
is an arbitrary value specified in userland, $characters
is taken from the hard-coded stub, which is currently a sequence of UTF-8 bytes.By changing the signature to ?string $characters = null
, what
is null on the second case too; the documentation might be able to say like:
When
$characters
isnull
, trimmed characters are as follows: U+0020, U+000C, U+000A, U+000D, U+0009, U+000B, U+0000, U+00A0, U+1680, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+2028, U+2029, U+202F, U+205F, U+3000, U+0085 and U+180E.
And it looks like what the code intended.
The implementation works fine. However, the stub's default value declaration " \f\n\r\t\v\x00\u{00A0}...
causes the problem.
Whether the parameter's default value is a string or is null is not the problem. The problem is that this
mb_trim($input, /* $characters is "\x20\x0C...", */ encoding: "Shift_JIS")
and this
mb_trim($input, "\x20\x0C...", "Shift_JIS")
behave differently. That is what should to be fixed. And there are multiple ways it could be fixed.
Yes, that could also be a time to decide whether to make the parameter use null
as the default.
CC @youkidearitai @alexdowad @nielsdos who were part of #12459, and where I see some brief conversation that touched on the subject of omitting the $characters list.
If change to default parameter to $characters = null
, I think need an RFC.
@ranvis What should we do solve this issue? I can't understand "inaccurate" that collect behavior (because my English is poor).
The problem is that the stub file is utf8, and so the unicode characters in the default value are encoded as utf8 too. That means when using a different character encoding, we have a mismatch in encoding.
Changing the argument type would indeed fix this. We need to bring it up on the mailing list and go from there.
Thank you @.nielsdos for describing concisely :)
@youkidearitai Depending on how the function is called, the different internal function is called. See below:
--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -3139,8 +3139,10 @@ static void php_do_mb_trim(INTERNAL_FUNCTION_PARAMETERS, mb_trim_mode mode)
}
if (what) {
+ puts("mb_trim_what_chars()");
RETURN_STR(mb_trim_what_chars(str, what, mode, enc));
} else {
+ puts("mb_trim_default_chars()");
RETURN_STR(mb_trim_default_chars(str, mode, enc));
}
}
<?php
echo "single argument: ";
mb_strlen(mb_trim("\u{3000}"));
echo "named argument: ";
mb_strlen(mb_trim("\u{3000}", encoding: 'UTF-8'));
This will print:
single argument: mb_trim_default_chars()
named argument: mb_trim_what_chars()
So, if user call the function using "named argument" without the $characters
parameter specified, what
is set to "\x20\x0c\x0a\x0d...\xe3\x80\x80\xc2\x85\xe1\xa0\x8e", the UTF-8 encoded string. But mb_trim expects what
to be in encoding
encoding.
If encoding is not UTF-8, the above two calls works differently.
I called it inaccuracy for the function doesn't actually use the advertised default value for "single argument" case. Or should I have called it discrepancy?
Well, sorry if my subject confused you. I hope this clarifies your question.
I've made a PoC PR to fix this with the proposed solution (i.e. making the argument null by default): https://github.com/php/php-src/pull/13820. Maybe there are other, nicer, solutions possible. Note: this is a proof-of-concept, not a commitment to this change. It must be discussed on the mailing list. This is just to show that the proposed solution would work.
Ah, I got it.
$character
is dependent to UTF-8, then encoding: SJIS
is not compatible $character
between $encoding
, right?
In a sense, according to specifications, but it's not intuitive. (And, not compatible mapping UTF-8 between SJIS, for example, \u{00A0}
convert to ?
(means can not convert))
There is also the problem of #13789, I think posted #13820 by @nielsdos seems to make sense. Thank you post the issue @ranvis , I will post and discussion PHP internals.
OK. I'll try to keep an eye on it.
Probably I could have focused on making examples.
mb_internal_encoding('Shift_JIS');
$str = mb_convert_encoding('俄には信じ難い?', 'Shift_JIS', 'UTF-8');
var_dump(mb_convert_encoding(mb_trim($str, encoding: 'Shift_JIS'), 'UTF-8'));
// string(19) "には信じ難い?"
Thanks all.
Description
The default values for the parameter
$characters
of the new mb_trim functions are not accurate. When the very same value as the default is implied to$characters
like the code below,mb_trim()
does not necessarily work the same way, because$characters
also depends on$encoding
. The parameter should be typed as?string $characters = null
instead.PHP Version
PHP dev-master
Operating System
No response