mb_detect_encoding() et al. need sufficient input

RV7PR commented 2 years ago

Description

The following code:

<?php
$strings = ['bl', 'Bl', 'Blog'];
foreach ($strings as &$str)  {
    $str = mb_convert_encoding($str, 'UTF-8', mb_list_encodings());
}
echo '<pre>' . print_r($strings, true) . '</pre>';

Resulted in this output:

Array
(
    [0] => 扬
    [1] => 求
    [2] => 求杯
)

But I expected this output instead:

Array
(
    [0] => bl
    [1] => Bl
    [2] => Blog
)

PHP Version

PHP 8.1.8

Operating System

No response

iluuu1994 commented 2 years ago

This seems to have changed in PHP 8.1. https://3v4l.org/1YAWt. However, it's worth pointing out that mb_detect_encoding() (or passing an array to $from_encoding in mb_convert_encoding()) is inherently unreliable. Character encodings have a lot of overlap of byte ranges, a string might be perfectly valid in many of them. The algorithm becomes more accurate the more information you give it. E.g. https://3v4l.org/1M37o. So mb_detect_encoding() is really just a best guess.

/cc @alexdowad

cmb69 commented 2 years ago

See also https://3v4l.org/m89fH, which explains what's happening. The three given strings are valid UCS-2 (roughly UTF-16) strings, and are treated like that. If any of these strings would consist of an odd number of bytes, they can't be UCS-2 strings, and wouldn't be detected as such.

Anyhow, like @iluuu1994 already mentioned, you should really give more (i.e. mostly larger) input to mb_convert_encoding()/mb_detect_encoding() so that these functions have a chance to properly detect the encoding. A few bytes are usually insufficient.

I think this is merely a documentation issue, and likely already tracked somewhere else.

alexdowad commented 2 years ago

@RV7PR Thanks for letting us know your observation. And thanks @iluuu1994 for the (true) comments which you shared.

As @iluuu1994 said, mbstring's automatic encoding detection has been amended to make it more accurate in the majority of cases. In some cases, though, the heuristics it uses may not return the answer which was desired.

In this case, we are taking the two bytes 0x6C 0x62. In UTF-16BE or UCS-2BE, those are U+6C62, the Chinese character 扬. (See https://unicode-table.com/en/626C/) This is a common word in the Chinese language.

Given just those two bytes, decoding them as 扬 is actually a very good guess. (Note that "bl" is not a word in English or any other language which I'm aware of.) If you are not Chinese and your users are not Chinese, it might seem silly that PHP is guessing that this text is Chinese... but remember, PHP is an international project and has users of all nationalities. With that in mind, "扬" is a very reasonable guess.

Basically what this comes down to, is that guessing the encoding of a string with only a few bytes is almost completely hopeless. You need to provide more input text, and even then, guessing text encodings is still dangerous. It's much better to provide a specific encoding to mb_convert_encoding whenever possible. Even when that is not possible, it's much better to provide a list of just a few possible encodings rather than using mb_list_encodings().

Hope that is clear, and thanks again.

alexdowad commented 2 years ago

@cmb69 I wonder if we should include a warning in the docs for mb_detect_encoding et al, telling people that automatic detection of text encoding is inherently unreliable and should be avoided whenever possible?

RV7PR commented 2 years ago

@alexdowad And thank you for your detailed explanation!

cmb69 commented 2 years ago

@alexdowad, yeah, we should document that. ICU's CharsetDetector class has similar docs:

For best accuracy in charset detection, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed.

iluuu1994 commented 2 years ago

Thanks @alexdowad! Closing this with the given explanation.

alexdowad commented 2 years ago

For best accuracy in charset detection, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed.

This is interesting. For what it's worth, "primarily in a single language" is not relevant for the current implementation of mb_detect_encoding. We could potentially get more accurate detection by taking into account the fact that natural text doesn't usually include (say) Greek, Chinese, and Cyrillic characters all in the same sentence. But we don't consider that at all right now.

"Minimum of a few hundred bytes" is very good.

cmb69 commented 2 years ago

I just noticed that this is very closely related to #1708; still leaving both tickets open for now, so we can address all these concerns.

php / doc-en