Open ENT8R opened 3 years ago
have you tried
$dom->setOptions( (new PHPHtmlParser\Options()) ->setEnforceEncoding('UTF-8') );
?
Yes, I'm aware of this option, but the problem is actually a different one:
The default internal encoding of my server is set to ISO 8859-1 (Latin-1), I now want to load a new document which is encoded in UTF-8. As this library does not pass the third parameter of mb_convert_encoding()
, the encoding of the input is automatically set to the default value of the server (ISO 8859-1 (Latin-1) in this case), while it actually should be UTF-8.
But, as I said above,
I'm not sure if it should be the responsibility of this library to ensure that the right encoding of the input is used (maybe by using an additional option) or whether it's the developers obligation to specify it...
The HTML cleaner converts the input to a specified encoding before any other operations: https://github.com/paquettg/php-html-parser/blob/4e01a438ad5961cc2d7427eb9798d213c8a12629/src/PHPHtmlParser/Dom/Cleaner.php#L128 The problem I see here is that according to the documentation it uses the internal default encoding if the third parameter is omitted:
I actually ran into this issue because the encoding on my server is set to ISO 8859-1 (Latin-1) while I tried to load a document encoded in UTF-8:
Input (encoded in UTF-8):
<p>große Wohnung in der Nähe der Innenstadt</p>
After running
mb_convert_encoding()
:<p>groÃe Wohnung in der Nähe der Innenstadt</p>
What happened is that
mb_convert_encoding()
thought that the input is not encoded in UTF-8 (as it actually is) but instead in ISO 8859-1 (Latin-1) as this is the value supplied by the server which caused some ugly "mojibake" after running that function.I'm not sure if it should be the responsibility of this library to ensure that the right encoding of the input is used (maybe by using an additional option) or whether it's the developers obligation to specify it...