php / php-src

The PHP Interpreter
https://www.php.net
Other
38.17k stars 7.75k forks source link

PHP8.1: mb_convert_encoding not working with ASCII chars above 127 #8744

Closed felixwatzka closed 2 years ago

felixwatzka commented 2 years ago

Description

The following code:

<?php

echo mb_convert_encoding ( chr(252), 'UTF-8', 'ASCII' );

Resulted in this output:

?

But I expected this output instead:

ü

In fact, since PHP 8.1, no ASCII character above 127 will be converted correctly. You can try the above example also here: https://3v4l.org/7ZASZ

PHP Version

PHP 8.1.6

Operating System

Ubuntu 22.04

cmb69 commented 2 years ago

Well, there are no ASCII characters above 127. The stricter behavior is likely deliberate. @alexdowad, could you please clarify?

felixwatzka commented 2 years ago

Oh yeah, you're right. Didn't even think about this. So this was actually a bug in all older PHP versions?

We fixed our application by using ISO-8859-1 instead of ASCII as $from_encoding.

alexdowad commented 2 years ago

Well, there are no ASCII characters above 127. The stricter behavior is likely deliberate. @alexdowad, could you please clarify?

Yep, that is right.

So this was actually a bug in all older PHP versions?

Yes, it was.

There were a lot of bugs in mbstring in older PHP versions. Most of them were so obscure that probably no user of PHP ever experienced them.

cmb69 commented 2 years ago

Okay, closing as invalid then.

alexdowad commented 2 years ago

...But thanks to @felixwatzka for reporting. Please keep letting us know if you notice anything else which seems unusual.

rhulka commented 1 year ago

I have a case when we load images from the NTEXT field (SQL Server) database. The images are saved to NTEXT from VBScript (I know that storing binary in NTEXT is deprecated and stupid, but the database is 20 years old and can't be easily changed).

In PHP 8.0.11 we used the following conversion:

mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

Which returned the correct binary representation of the Image but in PHP 8.1.8+, it doesn't work. Some bytes are replaced with 3F

CleanShot 2022-12-04 at 15 22 42

Any suggestion on how to emulate old behavior? I also struggle to find what exactly changes did this and in what version.

UPD:

This still works

return @\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8');
alexdowad commented 1 year ago

@rhulka Thank you very much for the report. I will be happy to investigate and report exactly what has caused the difference in behavior and when it changed.

If the change is unintentional, we will revert it; if it is intentional, we will advise how you can work around it.

Hope to check into this later today if possible. Thanks again for the report.

rhulka commented 1 year ago

@alexdowad, thanks a lot for the quick reply. We will go with UConverter at the moment. I understand that it's more of exploiting of "old bugs" in our case than normal usage, so do not rush.

To sum up

PHP 8

@\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8') === \UConverter::transcode($this->getRawData(), 'ibm-5348_P100-1997', 'UTF-8'); === mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

PHP 8.1

@\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8') === \UConverter::transcode($this->getRawData(), 'ibm-5348_P100-1997', 'UTF-8'); !== mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')

alexdowad commented 1 year ago

@rhulka Please also post the value of bin2hex($this->getRawData()) which demonstrates the problem.

alexdowad commented 1 year ago

Just investigating but can't get very far without knowing what $this->getRawData() actually is.

rhulka commented 1 year ago

@alexdowad

ntext_image.txt - raw data from the database (NTEXT field SQLServer 2016), it's not a "text". I can't upload with the .bin extension.

d5fdc38d6d139b4e186810c19725ffa8 - here is result image after conversion in PHP 8.0.11 (MacOS and Debian in docker container)

image_after_encoding_in_php81 - after encoding in PHP 8.1.12

ntext_image_bin2hex.txt - bin2hex output

$image = mb_convert_encoding($rawDataFromDbNtextField, 'Windows-1252', 'UTF-8');
header('Content-Type: image/jpeg');
echo $image;
exit;

Thank you

alexdowad commented 1 year ago

OK, it looks like the error markers (?) are being added not when decoding the UTF-8, but when converting to Windows-1252.

Please see this information on Windows-1252: https://en.wikipedia.org/wiki/Windows-1252

Looking at the conversion table there, you can see that some codepoints like U+0081 and U+008F have no mapping in Windows-1252. The old implementation of mbstring would convert U+0081 to 0x81 for Windows-1252 nonetheless.

The Wikipedia page does have an interesting note here:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too.

So if we want to improve compatibility with the Win32 API, we could restore mappings for U+0081, U+008D, U+0090, and U+009D.

I'm not sure if this would be enough to make Windows-1252 work for @rhulka's use case or not.

Any comments??

alexdowad commented 1 year ago

Calling @cmb69

alexdowad commented 1 year ago

Thinking about this a bit more. Given that mbstring is not "greenfield" but is a library with a long history, I think restoring the BC behavior is the right thing to do here.

rhulka commented 1 year ago

@alexdowad

I think it would be enough. The encoder in VBScript (a legacy app that we rewrite in PHP) does just that, that is why we got the idea to use this approach in PHP in the first place. I am not pro in encodings conversion, just noticed that behavior differs between versions, and U+0081, U+008D, U+0090... got replaced with 3F

rhulka commented 1 year ago

Does it was changed in this commit https://github.com/php/php-src/commit/b5ff87ca71375bbc5cb6eee93f15aff1cb756bb9 ? Then I can revert this and re-compile PHP from sources for testing

alexdowad commented 1 year ago

@rhulka Yes, that should be the one.

cmb69 commented 1 year ago

Given that mbstring is not "greenfield" but is a library with a long history, I think restoring the BC behavior is the right thing to do here.

I agree.

alexdowad commented 1 year ago

I should have a patch within a day or so.

themao commented 1 year ago

Thank you for bringing up this issue, I figured out that our app was exploiting this "bug" by converting from us-ascii to UTF-8, which worked up to version 8.0 and stopped working in 8.1. The app was accepting special characters like German umlauts ä, ö, ü, ß and tried to convert them to a proper UTF-8 to send to a third-party app and this was broken after an update to php8.2, these characters were replaced with ?? instead. Now the app is using ISO-8859-1 as a source encoding an everything works properly.

I apologize for commenting on a closed issue, I just thought it would be useful to add my case here so someone could maybe find this page in Google, I couldn't do it so had to browse the issues list.

alexdowad commented 1 year ago

Thanks for posting, @themao.

divinity76 commented 5 months ago

@rhulka

Any suggestion on how to emulate old behavior?

function windows1252_to_utf8(string $str): string
{
    return strtr($str, array(
        "\x80" => "\xE2\x82\xAC",
        "\x82" => "\xE2\x80\x9A",
        "\x84" => "\xE2\x80\x9E",
        "\x85" => "\xE2\x80\xA6",
        "\x86" => "\xE2\x80\xA0",
        "\x87" => "\xE2\x80\xA1",
        "\x89" => "\xE2\x80\xB0",
        "\x8B" => "\xE2\x80\xB9",
        "\x91" => "\xE2\x80\x98",
        "\x92" => "\xE2\x80\x99",
        "\x93" => "\xE2\x80\x9C",
        "\x94" => "\xE2\x80\x9D",
        "\x95" => "\xE2\x80\xA2",
        "\x96" => "\xE2\x80\x93",
        "\x97" => "\xE2\x80\x94",
        "\x99" => "\xE2\x84\xA2",
        "\x9B" => "\xE2\x80\xBA",
        "\x81" => "\xC2\x81",
        "\x83" => "\xC6\x92",
        "\x88" => "\xCB\x86",
        "\x8A" => "\xC5\xA0",
        "\x8C" => "\xC5\x92",
        "\x8D" => "\xC2\x8D",
        "\x8E" => "\xC5\xBD",
        "\x8F" => "\xC2\x8F",
        "\x90" => "\xC2\x90",
        "\x98" => "\xCB\x9C",
        "\x9A" => "\xC5\xA1",
        "\x9C" => "\xC5\x93",
        "\x9D" => "\xC2\x9D",
        "\x9E" => "\xC5\xBE",
        "\x9F" => "\xC5\xB8",
        "\xA0" => "\xC2\xA0",
        "\xA1" => "\xC2\xA1",
        "\xA2" => "\xC2\xA2",
        "\xA3" => "\xC2\xA3",
        "\xA4" => "\xC2\xA4",
        "\xA5" => "\xC2\xA5",
        "\xA6" => "\xC2\xA6",
        "\xA7" => "\xC2\xA7",
        "\xA8" => "\xC2\xA8",
        "\xA9" => "\xC2\xA9",
        "\xAA" => "\xC2\xAA",
        "\xAB" => "\xC2\xAB",
        "\xAC" => "\xC2\xAC",
        "\xAD" => "\xC2\xAD",
        "\xAE" => "\xC2\xAE",
        "\xAF" => "\xC2\xAF",
        "\xB0" => "\xC2\xB0",
        "\xB1" => "\xC2\xB1",
        "\xB2" => "\xC2\xB2",
        "\xB3" => "\xC2\xB3",
        "\xB4" => "\xC2\xB4",
        "\xB5" => "\xC2\xB5",
        "\xB6" => "\xC2\xB6",
        "\xB7" => "\xC2\xB7",
        "\xB8" => "\xC2\xB8",
        "\xB9" => "\xC2\xB9",
        "\xBA" => "\xC2\xBA",
        "\xBB" => "\xC2\xBB",
        "\xBC" => "\xC2\xBC",
        "\xBD" => "\xC2\xBD",
        "\xBE" => "\xC2\xBE",
        "\xBF" => "\xC2\xBF",
        "\xC0" => "\xC3\x80",
        "\xC1" => "\xC3\x81",
        "\xC2" => "\xC3\x82",
        "\xC3" => "\xC3\x83",
        "\xC4" => "\xC3\x84",
        "\xC5" => "\xC3\x85",
        "\xC6" => "\xC3\x86",
        "\xC7" => "\xC3\x87",
        "\xC8" => "\xC3\x88",
        "\xC9" => "\xC3\x89",
        "\xCA" => "\xC3\x8A",
        "\xCB" => "\xC3\x8B",
        "\xCC" => "\xC3\x8C",
        "\xCD" => "\xC3\x8D",
        "\xCE" => "\xC3\x8E",
        "\xCF" => "\xC3\x8F",
        "\xD0" => "\xC3\x90",
        "\xD1" => "\xC3\x91",
        "\xD2" => "\xC3\x92",
        "\xD3" => "\xC3\x93",
        "\xD4" => "\xC3\x94",
        "\xD5" => "\xC3\x95",
        "\xD6" => "\xC3\x96",
        "\xD7" => "\xC3\x97",
        "\xD8" => "\xC3\x98",
        "\xD9" => "\xC3\x99",
        "\xDA" => "\xC3\x9A",
        "\xDB" => "\xC3\x9B",
        "\xDC" => "\xC3\x9C",
        "\xDD" => "\xC3\x9D",
        "\xDE" => "\xC3\x9E",
        "\xDF" => "\xC3\x9F",
        "\xE0" => "\xC3\xA0",
        "\xE1" => "\xC3\xA1",
        "\xE2" => "\xC3\xA2",
        "\xE3" => "\xC3\xA3",
        "\xE4" => "\xC3\xA4",
        "\xE5" => "\xC3\xA5",
        "\xE6" => "\xC3\xA6",
        "\xE7" => "\xC3\xA7",
        "\xE8" => "\xC3\xA8",
        "\xE9" => "\xC3\xA9",
        "\xEA" => "\xC3\xAA",
        "\xEB" => "\xC3\xAB",
        "\xEC" => "\xC3\xAC",
        "\xED" => "\xC3\xAD",
        "\xEE" => "\xC3\xAE",
        "\xEF" => "\xC3\xAF",
        "\xF0" => "\xC3\xB0",
        "\xF1" => "\xC3\xB1",
        "\xF2" => "\xC3\xB2",
        "\xF3" => "\xC3\xB3",
        "\xF4" => "\xC3\xB4",
        "\xF5" => "\xC3\xB5",
        "\xF6" => "\xC3\xB6",
        "\xF7" => "\xC3\xB7",
        "\xF8" => "\xC3\xB8",
        "\xF9" => "\xC3\xB9",
        "\xFA" => "\xC3\xBA",
        "\xFB" => "\xC3\xBB",
        "\xFC" => "\xC3\xBC",
        "\xFD" => "\xC3\xBD",
        "\xFE" => "\xC3\xBE",
        "\xFF" => "\xC3\xBF",
    ));
}
function utf8_to_windows1252(string $str): string
{
    return strtr($str, array(
        "\xE2\x82\xAC" => "\x80",
        "\xE2\x80\x9A" => "\x82",
        "\xE2\x80\x9E" => "\x84",
        "\xE2\x80\xA6" => "\x85",
        "\xE2\x80\xA0" => "\x86",
        "\xE2\x80\xA1" => "\x87",
        "\xE2\x80\xB0" => "\x89",
        "\xE2\x80\xB9" => "\x8B",
        "\xE2\x80\x98" => "\x91",
        "\xE2\x80\x99" => "\x92",
        "\xE2\x80\x9C" => "\x93",
        "\xE2\x80\x9D" => "\x94",
        "\xE2\x80\xA2" => "\x95",
        "\xE2\x80\x93" => "\x96",
        "\xE2\x80\x94" => "\x97",
        "\xE2\x84\xA2" => "\x99",
        "\xE2\x80\xBA" => "\x9B",
        "\xC2\x81" => "\x81",
        "\xC6\x92" => "\x83",
        "\xCB\x86" => "\x88",
        "\xC5\xA0" => "\x8A",
        "\xC5\x92" => "\x8C",
        "\xC2\x8D" => "\x8D",
        "\xC5\xBD" => "\x8E",
        "\xC2\x8F" => "\x8F",
        "\xC2\x90" => "\x90",
        "\xCB\x9C" => "\x98",
        "\xC5\xA1" => "\x9A",
        "\xC5\x93" => "\x9C",
        "\xC2\x9D" => "\x9D",
        "\xC5\xBE" => "\x9E",
        "\xC5\xB8" => "\x9F",
        "\xC2\xA0" => "\xA0",
        "\xC2\xA1" => "\xA1",
        "\xC2\xA2" => "\xA2",
        "\xC2\xA3" => "\xA3",
        "\xC2\xA4" => "\xA4",
        "\xC2\xA5" => "\xA5",
        "\xC2\xA6" => "\xA6",
        "\xC2\xA7" => "\xA7",
        "\xC2\xA8" => "\xA8",
        "\xC2\xA9" => "\xA9",
        "\xC2\xAA" => "\xAA",
        "\xC2\xAB" => "\xAB",
        "\xC2\xAC" => "\xAC",
        "\xC2\xAD" => "\xAD",
        "\xC2\xAE" => "\xAE",
        "\xC2\xAF" => "\xAF",
        "\xC2\xB0" => "\xB0",
        "\xC2\xB1" => "\xB1",
        "\xC2\xB2" => "\xB2",
        "\xC2\xB3" => "\xB3",
        "\xC2\xB4" => "\xB4",
        "\xC2\xB5" => "\xB5",
        "\xC2\xB6" => "\xB6",
        "\xC2\xB7" => "\xB7",
        "\xC2\xB8" => "\xB8",
        "\xC2\xB9" => "\xB9",
        "\xC2\xBA" => "\xBA",
        "\xC2\xBB" => "\xBB",
        "\xC2\xBC" => "\xBC",
        "\xC2\xBD" => "\xBD",
        "\xC2\xBE" => "\xBE",
        "\xC2\xBF" => "\xBF",
        "\xC3\x80" => "\xC0",
        "\xC3\x81" => "\xC1",
        "\xC3\x82" => "\xC2",
        "\xC3\x83" => "\xC3",
        "\xC3\x84" => "\xC4",
        "\xC3\x85" => "\xC5",
        "\xC3\x86" => "\xC6",
        "\xC3\x87" => "\xC7",
        "\xC3\x88" => "\xC8",
        "\xC3\x89" => "\xC9",
        "\xC3\x8A" => "\xCA",
        "\xC3\x8B" => "\xCB",
        "\xC3\x8C" => "\xCC",
        "\xC3\x8D" => "\xCD",
        "\xC3\x8E" => "\xCE",
        "\xC3\x8F" => "\xCF",
        "\xC3\x90" => "\xD0",
        "\xC3\x91" => "\xD1",
        "\xC3\x92" => "\xD2",
        "\xC3\x93" => "\xD3",
        "\xC3\x94" => "\xD4",
        "\xC3\x95" => "\xD5",
        "\xC3\x96" => "\xD6",
        "\xC3\x97" => "\xD7",
        "\xC3\x98" => "\xD8",
        "\xC3\x99" => "\xD9",
        "\xC3\x9A" => "\xDA",
        "\xC3\x9B" => "\xDB",
        "\xC3\x9C" => "\xDC",
        "\xC3\x9D" => "\xDD",
        "\xC3\x9E" => "\xDE",
        "\xC3\x9F" => "\xDF",
        "\xC3\xA0" => "\xE0",
        "\xC3\xA1" => "\xE1",
        "\xC3\xA2" => "\xE2",
        "\xC3\xA3" => "\xE3",
        "\xC3\xA4" => "\xE4",
        "\xC3\xA5" => "\xE5",
        "\xC3\xA6" => "\xE6",
        "\xC3\xA7" => "\xE7",
        "\xC3\xA8" => "\xE8",
        "\xC3\xA9" => "\xE9",
        "\xC3\xAA" => "\xEA",
        "\xC3\xAB" => "\xEB",
        "\xC3\xAC" => "\xEC",
        "\xC3\xAD" => "\xED",
        "\xC3\xAE" => "\xEE",
        "\xC3\xAF" => "\xEF",
        "\xC3\xB0" => "\xF0",
        "\xC3\xB1" => "\xF1",
        "\xC3\xB2" => "\xF2",
        "\xC3\xB3" => "\xF3",
        "\xC3\xB4" => "\xF4",
        "\xC3\xB5" => "\xF5",
        "\xC3\xB6" => "\xF6",
        "\xC3\xB7" => "\xF7",
        "\xC3\xB8" => "\xF8",
        "\xC3\xB9" => "\xF9",
        "\xC3\xBA" => "\xFA",
        "\xC3\xBB" => "\xFB",
        "\xC3\xBC" => "\xFC",
        "\xC3\xBD" => "\xFD",
        "\xC3\xBE" => "\xFE",
        "\xC3\xBF" => "\xFF",
    ));
}

should work.