Closed felixwatzka closed 2 years ago
Well, there are no ASCII characters above 127. The stricter behavior is likely deliberate. @alexdowad, could you please clarify?
Oh yeah, you're right. Didn't even think about this. So this was actually a bug in all older PHP versions?
We fixed our application by using ISO-8859-1 instead of ASCII as $from_encoding.
Well, there are no ASCII characters above 127. The stricter behavior is likely deliberate. @alexdowad, could you please clarify?
Yep, that is right.
So this was actually a bug in all older PHP versions?
Yes, it was.
There were a lot of bugs in mbstring in older PHP versions. Most of them were so obscure that probably no user of PHP ever experienced them.
Okay, closing as invalid then.
...But thanks to @felixwatzka for reporting. Please keep letting us know if you notice anything else which seems unusual.
I have a case when we load images from the NTEXT field (SQL Server) database. The images are saved to NTEXT from VBScript (I know that storing binary in NTEXT is deprecated and stupid, but the database is 20 years old and can't be easily changed).
In PHP 8.0.11 we used the following conversion:
mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')
Which returned the correct binary representation of the Image but in PHP 8.1.8+, it doesn't work. Some bytes are replaced with 3F
Any suggestion on how to emulate old behavior? I also struggle to find what exactly changes did this and in what version.
UPD:
This still works
return @\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8');
@rhulka Thank you very much for the report. I will be happy to investigate and report exactly what has caused the difference in behavior and when it changed.
If the change is unintentional, we will revert it; if it is intentional, we will advise how you can work around it.
Hope to check into this later today if possible. Thanks again for the report.
@alexdowad, thanks a lot for the quick reply. We will go with UConverter
at the moment. I understand that it's more of exploiting of "old bugs" in our case than normal usage, so do not rush.
To sum up
PHP 8
@\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8')
=== \UConverter::transcode($this->getRawData(), 'ibm-5348_P100-1997', 'UTF-8');
=== mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')
PHP 8.1
@\UConverter::transcode($this->getRawData(), 'Windows-1252', 'UTF-8')
=== \UConverter::transcode($this->getRawData(), 'ibm-5348_P100-1997', 'UTF-8');
!== mb_convert_encoding($this->getRawData(), 'Windows-1252', 'UTF-8')
@rhulka Please also post the value of bin2hex($this->getRawData())
which demonstrates the problem.
Just investigating but can't get very far without knowing what $this->getRawData()
actually is.
@alexdowad
ntext_image.txt - raw data from the database (NTEXT field SQLServer 2016), it's not a "text". I can't upload with the .bin extension.
- here is result image after conversion in PHP 8.0.11 (MacOS and Debian in docker container)
- after encoding in PHP 8.1.12
ntext_image_bin2hex.txt - bin2hex output
$image = mb_convert_encoding($rawDataFromDbNtextField, 'Windows-1252', 'UTF-8');
header('Content-Type: image/jpeg');
echo $image;
exit;
Thank you
OK, it looks like the error markers (?
) are being added not when decoding the UTF-8, but when converting to Windows-1252.
Please see this information on Windows-1252: https://en.wikipedia.org/wiki/Windows-1252
Looking at the conversion table there, you can see that some codepoints like U+0081 and U+008F have no mapping in Windows-1252. The old implementation of mbstring would convert U+0081 to 0x81
for Windows-1252 nonetheless.
The Wikipedia page does have an interesting note here:
According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too.
So if we want to improve compatibility with the Win32 API, we could restore mappings for U+0081, U+008D, U+0090, and U+009D.
I'm not sure if this would be enough to make Windows-1252 work for @rhulka's use case or not.
Any comments??
Calling @cmb69
Thinking about this a bit more. Given that mbstring is not "greenfield" but is a library with a long history, I think restoring the BC behavior is the right thing to do here.
@alexdowad
I think it would be enough. The encoder in VBScript (a legacy app that we rewrite in PHP) does just that, that is why we got the idea to use this approach in PHP in the first place. I am not pro in encodings conversion, just noticed that behavior differs between versions, and U+0081, U+008D, U+0090...
got replaced with 3F
Does it was changed in this commit https://github.com/php/php-src/commit/b5ff87ca71375bbc5cb6eee93f15aff1cb756bb9 ? Then I can revert this and re-compile PHP from sources for testing
@rhulka Yes, that should be the one.
Given that mbstring is not "greenfield" but is a library with a long history, I think restoring the BC behavior is the right thing to do here.
I agree.
I should have a patch within a day or so.
Thank you for bringing up this issue, I figured out that our app was exploiting this "bug" by converting from us-ascii
to UTF-8
, which worked up to version 8.0 and stopped working in 8.1. The app was accepting special characters like German umlauts ä, ö, ü, ß and tried to convert them to a proper UTF-8 to send to a third-party app and this was broken after an update to php8.2, these characters were replaced with ??
instead. Now the app is using ISO-8859-1
as a source encoding an everything works properly.
I apologize for commenting on a closed issue, I just thought it would be useful to add my case here so someone could maybe find this page in Google, I couldn't do it so had to browse the issues list.
Thanks for posting, @themao.
@rhulka
Any suggestion on how to emulate old behavior?
function windows1252_to_utf8(string $str): string
{
return strtr($str, array(
"\x80" => "\xE2\x82\xAC",
"\x82" => "\xE2\x80\x9A",
"\x84" => "\xE2\x80\x9E",
"\x85" => "\xE2\x80\xA6",
"\x86" => "\xE2\x80\xA0",
"\x87" => "\xE2\x80\xA1",
"\x89" => "\xE2\x80\xB0",
"\x8B" => "\xE2\x80\xB9",
"\x91" => "\xE2\x80\x98",
"\x92" => "\xE2\x80\x99",
"\x93" => "\xE2\x80\x9C",
"\x94" => "\xE2\x80\x9D",
"\x95" => "\xE2\x80\xA2",
"\x96" => "\xE2\x80\x93",
"\x97" => "\xE2\x80\x94",
"\x99" => "\xE2\x84\xA2",
"\x9B" => "\xE2\x80\xBA",
"\x81" => "\xC2\x81",
"\x83" => "\xC6\x92",
"\x88" => "\xCB\x86",
"\x8A" => "\xC5\xA0",
"\x8C" => "\xC5\x92",
"\x8D" => "\xC2\x8D",
"\x8E" => "\xC5\xBD",
"\x8F" => "\xC2\x8F",
"\x90" => "\xC2\x90",
"\x98" => "\xCB\x9C",
"\x9A" => "\xC5\xA1",
"\x9C" => "\xC5\x93",
"\x9D" => "\xC2\x9D",
"\x9E" => "\xC5\xBE",
"\x9F" => "\xC5\xB8",
"\xA0" => "\xC2\xA0",
"\xA1" => "\xC2\xA1",
"\xA2" => "\xC2\xA2",
"\xA3" => "\xC2\xA3",
"\xA4" => "\xC2\xA4",
"\xA5" => "\xC2\xA5",
"\xA6" => "\xC2\xA6",
"\xA7" => "\xC2\xA7",
"\xA8" => "\xC2\xA8",
"\xA9" => "\xC2\xA9",
"\xAA" => "\xC2\xAA",
"\xAB" => "\xC2\xAB",
"\xAC" => "\xC2\xAC",
"\xAD" => "\xC2\xAD",
"\xAE" => "\xC2\xAE",
"\xAF" => "\xC2\xAF",
"\xB0" => "\xC2\xB0",
"\xB1" => "\xC2\xB1",
"\xB2" => "\xC2\xB2",
"\xB3" => "\xC2\xB3",
"\xB4" => "\xC2\xB4",
"\xB5" => "\xC2\xB5",
"\xB6" => "\xC2\xB6",
"\xB7" => "\xC2\xB7",
"\xB8" => "\xC2\xB8",
"\xB9" => "\xC2\xB9",
"\xBA" => "\xC2\xBA",
"\xBB" => "\xC2\xBB",
"\xBC" => "\xC2\xBC",
"\xBD" => "\xC2\xBD",
"\xBE" => "\xC2\xBE",
"\xBF" => "\xC2\xBF",
"\xC0" => "\xC3\x80",
"\xC1" => "\xC3\x81",
"\xC2" => "\xC3\x82",
"\xC3" => "\xC3\x83",
"\xC4" => "\xC3\x84",
"\xC5" => "\xC3\x85",
"\xC6" => "\xC3\x86",
"\xC7" => "\xC3\x87",
"\xC8" => "\xC3\x88",
"\xC9" => "\xC3\x89",
"\xCA" => "\xC3\x8A",
"\xCB" => "\xC3\x8B",
"\xCC" => "\xC3\x8C",
"\xCD" => "\xC3\x8D",
"\xCE" => "\xC3\x8E",
"\xCF" => "\xC3\x8F",
"\xD0" => "\xC3\x90",
"\xD1" => "\xC3\x91",
"\xD2" => "\xC3\x92",
"\xD3" => "\xC3\x93",
"\xD4" => "\xC3\x94",
"\xD5" => "\xC3\x95",
"\xD6" => "\xC3\x96",
"\xD7" => "\xC3\x97",
"\xD8" => "\xC3\x98",
"\xD9" => "\xC3\x99",
"\xDA" => "\xC3\x9A",
"\xDB" => "\xC3\x9B",
"\xDC" => "\xC3\x9C",
"\xDD" => "\xC3\x9D",
"\xDE" => "\xC3\x9E",
"\xDF" => "\xC3\x9F",
"\xE0" => "\xC3\xA0",
"\xE1" => "\xC3\xA1",
"\xE2" => "\xC3\xA2",
"\xE3" => "\xC3\xA3",
"\xE4" => "\xC3\xA4",
"\xE5" => "\xC3\xA5",
"\xE6" => "\xC3\xA6",
"\xE7" => "\xC3\xA7",
"\xE8" => "\xC3\xA8",
"\xE9" => "\xC3\xA9",
"\xEA" => "\xC3\xAA",
"\xEB" => "\xC3\xAB",
"\xEC" => "\xC3\xAC",
"\xED" => "\xC3\xAD",
"\xEE" => "\xC3\xAE",
"\xEF" => "\xC3\xAF",
"\xF0" => "\xC3\xB0",
"\xF1" => "\xC3\xB1",
"\xF2" => "\xC3\xB2",
"\xF3" => "\xC3\xB3",
"\xF4" => "\xC3\xB4",
"\xF5" => "\xC3\xB5",
"\xF6" => "\xC3\xB6",
"\xF7" => "\xC3\xB7",
"\xF8" => "\xC3\xB8",
"\xF9" => "\xC3\xB9",
"\xFA" => "\xC3\xBA",
"\xFB" => "\xC3\xBB",
"\xFC" => "\xC3\xBC",
"\xFD" => "\xC3\xBD",
"\xFE" => "\xC3\xBE",
"\xFF" => "\xC3\xBF",
));
}
function utf8_to_windows1252(string $str): string
{
return strtr($str, array(
"\xE2\x82\xAC" => "\x80",
"\xE2\x80\x9A" => "\x82",
"\xE2\x80\x9E" => "\x84",
"\xE2\x80\xA6" => "\x85",
"\xE2\x80\xA0" => "\x86",
"\xE2\x80\xA1" => "\x87",
"\xE2\x80\xB0" => "\x89",
"\xE2\x80\xB9" => "\x8B",
"\xE2\x80\x98" => "\x91",
"\xE2\x80\x99" => "\x92",
"\xE2\x80\x9C" => "\x93",
"\xE2\x80\x9D" => "\x94",
"\xE2\x80\xA2" => "\x95",
"\xE2\x80\x93" => "\x96",
"\xE2\x80\x94" => "\x97",
"\xE2\x84\xA2" => "\x99",
"\xE2\x80\xBA" => "\x9B",
"\xC2\x81" => "\x81",
"\xC6\x92" => "\x83",
"\xCB\x86" => "\x88",
"\xC5\xA0" => "\x8A",
"\xC5\x92" => "\x8C",
"\xC2\x8D" => "\x8D",
"\xC5\xBD" => "\x8E",
"\xC2\x8F" => "\x8F",
"\xC2\x90" => "\x90",
"\xCB\x9C" => "\x98",
"\xC5\xA1" => "\x9A",
"\xC5\x93" => "\x9C",
"\xC2\x9D" => "\x9D",
"\xC5\xBE" => "\x9E",
"\xC5\xB8" => "\x9F",
"\xC2\xA0" => "\xA0",
"\xC2\xA1" => "\xA1",
"\xC2\xA2" => "\xA2",
"\xC2\xA3" => "\xA3",
"\xC2\xA4" => "\xA4",
"\xC2\xA5" => "\xA5",
"\xC2\xA6" => "\xA6",
"\xC2\xA7" => "\xA7",
"\xC2\xA8" => "\xA8",
"\xC2\xA9" => "\xA9",
"\xC2\xAA" => "\xAA",
"\xC2\xAB" => "\xAB",
"\xC2\xAC" => "\xAC",
"\xC2\xAD" => "\xAD",
"\xC2\xAE" => "\xAE",
"\xC2\xAF" => "\xAF",
"\xC2\xB0" => "\xB0",
"\xC2\xB1" => "\xB1",
"\xC2\xB2" => "\xB2",
"\xC2\xB3" => "\xB3",
"\xC2\xB4" => "\xB4",
"\xC2\xB5" => "\xB5",
"\xC2\xB6" => "\xB6",
"\xC2\xB7" => "\xB7",
"\xC2\xB8" => "\xB8",
"\xC2\xB9" => "\xB9",
"\xC2\xBA" => "\xBA",
"\xC2\xBB" => "\xBB",
"\xC2\xBC" => "\xBC",
"\xC2\xBD" => "\xBD",
"\xC2\xBE" => "\xBE",
"\xC2\xBF" => "\xBF",
"\xC3\x80" => "\xC0",
"\xC3\x81" => "\xC1",
"\xC3\x82" => "\xC2",
"\xC3\x83" => "\xC3",
"\xC3\x84" => "\xC4",
"\xC3\x85" => "\xC5",
"\xC3\x86" => "\xC6",
"\xC3\x87" => "\xC7",
"\xC3\x88" => "\xC8",
"\xC3\x89" => "\xC9",
"\xC3\x8A" => "\xCA",
"\xC3\x8B" => "\xCB",
"\xC3\x8C" => "\xCC",
"\xC3\x8D" => "\xCD",
"\xC3\x8E" => "\xCE",
"\xC3\x8F" => "\xCF",
"\xC3\x90" => "\xD0",
"\xC3\x91" => "\xD1",
"\xC3\x92" => "\xD2",
"\xC3\x93" => "\xD3",
"\xC3\x94" => "\xD4",
"\xC3\x95" => "\xD5",
"\xC3\x96" => "\xD6",
"\xC3\x97" => "\xD7",
"\xC3\x98" => "\xD8",
"\xC3\x99" => "\xD9",
"\xC3\x9A" => "\xDA",
"\xC3\x9B" => "\xDB",
"\xC3\x9C" => "\xDC",
"\xC3\x9D" => "\xDD",
"\xC3\x9E" => "\xDE",
"\xC3\x9F" => "\xDF",
"\xC3\xA0" => "\xE0",
"\xC3\xA1" => "\xE1",
"\xC3\xA2" => "\xE2",
"\xC3\xA3" => "\xE3",
"\xC3\xA4" => "\xE4",
"\xC3\xA5" => "\xE5",
"\xC3\xA6" => "\xE6",
"\xC3\xA7" => "\xE7",
"\xC3\xA8" => "\xE8",
"\xC3\xA9" => "\xE9",
"\xC3\xAA" => "\xEA",
"\xC3\xAB" => "\xEB",
"\xC3\xAC" => "\xEC",
"\xC3\xAD" => "\xED",
"\xC3\xAE" => "\xEE",
"\xC3\xAF" => "\xEF",
"\xC3\xB0" => "\xF0",
"\xC3\xB1" => "\xF1",
"\xC3\xB2" => "\xF2",
"\xC3\xB3" => "\xF3",
"\xC3\xB4" => "\xF4",
"\xC3\xB5" => "\xF5",
"\xC3\xB6" => "\xF6",
"\xC3\xB7" => "\xF7",
"\xC3\xB8" => "\xF8",
"\xC3\xB9" => "\xF9",
"\xC3\xBA" => "\xFA",
"\xC3\xBB" => "\xFB",
"\xC3\xBC" => "\xFC",
"\xC3\xBD" => "\xFD",
"\xC3\xBE" => "\xFE",
"\xC3\xBF" => "\xFF",
));
}
should work.
Description
The following code:
Resulted in this output:
But I expected this output instead:
In fact, since PHP 8.1, no ASCII character above 127 will be converted correctly. You can try the above example also here: https://3v4l.org/7ZASZ
PHP Version
PHP 8.1.6
Operating System
Ubuntu 22.04