Not encoding UTF-8 correctly

shtse8 commented 7 years ago

Code to reproduce:

use \Wa72\HtmlPageDom\HtmlPageCrawler;
$html = <<<EOF
<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！</title>
<body>
网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！
</body>
</html>
EOF;
$document = new HtmlPageCrawler($html);
echo $document->saveHTML();

Result:

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8"><title>&#32593;&#21451;&#32456;&#20110;&#32905;&#25628;&#20986;&#12300;&#33539;&#20912;&#20912;&#12301;&#23478;&#26063;&#29031;&#29255;&#65292;&#27809;&#24819;&#21040;&#30475;&#35265;&#22905;&#22902;&#22902;&#25165;&#21457;&#29616;&#12300;&#33539;&#20912;&#20912;&#26159;&#20840;&#23478;&#26368;&#38590;&#30475;&#30340;&#12301;&#65281;</title></head><body>
&#32593;&#21451;&#32456;&#20110;&#32905;&#25628;&#20986;&#12300;&#33539;&#20912;&#20912;&#12301;&#23478;&#26063;&#29031;&#29255;&#65292;&#27809;&#24819;&#21040;&#30475;&#35265;&#22905;&#22902;&#22902;&#25165;&#21457;&#29616;&#12300;&#33539;&#20912;&#20912;&#26159;&#20840;&#23478;&#26368;&#38590;&#30475;&#30340;&#12301;&#65281;
</body></html>

Expected Result:

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！</title>
<body>
网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！
</body></html>

It is a known bug of PHP DomDocument. Here is the reference: http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly

shtse8 commented 7 years ago

I have tried DomCrawler, but it is fine without any problem when encoding Utf-8.

Code:

use Symfony\Component\DomCrawler\Crawler;
$html = <<<EOF
<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！</title>
<body>
网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！
</body>
</html>
EOF;
$crawler = new Crawler($html);
echo $crawler->html();

Result:

<head>
<meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！</title>
</head>
<body>
网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！
</body>

And I find they have fixed this bug already. https://github.com/symfony/dom-crawler/pull/4

shtse8 commented 7 years ago

I am trying to fix this problem. https://github.com/wasinger/htmlpagedom/pull/19

kukungkung commented 6 years ago

I think it can help you curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');

yellow1912 commented 5 years ago

Any update on this issue, do we still have problem with UTF-8 because this is a huge problem if it does exist, most sites use UTF-8 anyhow.

havran commented 5 years ago

Same problem here, in v1.3.

glensc commented 5 years ago

I just decode entities after the save:

        $html = html_entity_decode($html, ENT_NOQUOTES, 'UTF-8');

glensc commented 5 years ago

seems the underlying problem is that symfony/dom-crawler switches to entities to avoid some other bugs:

https://github.com/symfony/dom-crawler/blob/v4.2.4/Crawler.php#L195

havran commented 5 years ago

I just decode entities after the save:
        $html = html_entity_decode($html, ENT_NOQUOTES, 'UTF-8');

I think this is not good idea - because this decode all entities from HTML (for example i have bigger document where can by for example used > or  ).

I found solution which work for me - in line:

https://github.com/wasinger/htmlpagedom/blob/563bc7a399b473631b644cc35e63202fd61987ac/src/HtmlPageCrawler.php#L887 i change from:

return $this->getDOMDocument()->saveHTML();

to:

return $this->getDOMDocument()->saveHTML($this->getDOMDocument()->documentElement);

based by https://stackoverflow.com/a/20675396

havran commented 5 years ago

But my solution for some reason remove DOCTYPE in this test script:

<?php

$html = <<<EOF
<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#">
  <head>
    <meta charset="UTF-8">
    <title>网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！</title>
  </head>
  <body>
    网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！
  </body>
</html>
EOF;

use \Wa72\HtmlPageDom\HtmlPageCrawler;
$document = new HtmlPageCrawler($html);
echo "--- HtmlPageDom -----------------------------------------------------------" .PHP_EOL.PHP_EOL;
echo $document->saveHTML();
echo PHP_EOL;

use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
echo "--- DomCrawler ------------------------------------------------------------" .PHP_EOL.PHP_EOL;
echo $crawler->html();

This is output:

vagrant@d8:/data/uniweb/uniweb-cms/cms3[feature/UCMS-313-content-migration-blogs *]$ drush @lv scr test.php
--- HtmlPageDom -----------------------------------------------------------

<html prefix="og: http://ogp.me/ns#">
<head>
<meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！</title>
</head>
<body>
    网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！
  </body>
</html>
--- DomCrawler ------------------------------------------------------------

<head>
<meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！</title>
</head>
<body>
    网友终于肉搜出「范冰冰」家族照片，没想到看见她奶奶才发现「范冰冰是全家最难看的」！
  </body>

glensc commented 5 years ago

html_entity_decode is perfectly valid, if you wanted to have putput as & then it should be in source document as &amp;.

havran commented 5 years ago

I make small test script for compare various DOM parsers - https://github.com/havran/php-html-parsers-test

Old simplehtmldom seems still best :-).

wasinger / htmlpagedom

Not encoding UTF-8 correctly #18