paquettg / php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.
MIT License
2.37k stars 463 forks source link

mb_convert_encoding returns unwanted characters #265

Open ENT8R opened 3 years ago

ENT8R commented 3 years ago

The HTML cleaner converts the input to a specified encoding before any other operations: https://github.com/paquettg/php-html-parser/blob/4e01a438ad5961cc2d7427eb9798d213c8a12629/src/PHPHtmlParser/Dom/Cleaner.php#L128 The problem I see here is that according to the documentation it uses the internal default encoding if the third parameter is omitted:

If from_encoding is not specified, the internal encoding will be used.

I actually ran into this issue because the encoding on my server is set to ISO 8859-1 (Latin-1) while I tried to load a document encoded in UTF-8:

Input (encoded in UTF-8): <p>große Wohnung in der Nähe der Innenstadt</p>

After running mb_convert_encoding(): <p>große Wohnung in der Nähe der Innenstadt</p>

What happened is that mb_convert_encoding() thought that the input is not encoded in UTF-8 (as it actually is) but instead in ISO 8859-1 (Latin-1) as this is the value supplied by the server which caused some ugly "mojibake" after running that function.

I'm not sure if it should be the responsibility of this library to ensure that the right encoding of the input is used (maybe by using an additional option) or whether it's the developers obligation to specify it...

honsberg commented 3 years ago

have you tried $dom->setOptions( (new PHPHtmlParser\Options()) ->setEnforceEncoding('UTF-8') ); ?

ENT8R commented 3 years ago

Yes, I'm aware of this option, but the problem is actually a different one: The default internal encoding of my server is set to ISO 8859-1 (Latin-1), I now want to load a new document which is encoded in UTF-8. As this library does not pass the third parameter of mb_convert_encoding(), the encoding of the input is automatically set to the default value of the server (ISO 8859-1 (Latin-1) in this case), while it actually should be UTF-8.

But, as I said above,

I'm not sure if it should be the responsibility of this library to ensure that the right encoding of the input is used (maybe by using an additional option) or whether it's the developers obligation to specify it...