Open sylus opened 12 years ago
I found a temporary fix by using the following function and wraping a HTML Stub around $taxonomy... Curious why this makes things work?
<?php
protected function wrapHTML() {
// We add surrounding <html> and <head> tags.
$html = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">';
$html .= '<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';
$html .= $this->html;
$html .= '</body></html>';
$this->html = $html;
}
When you use htmlqp(), QueryPath (libxml, actually) tries to repair anything that looks broken in an HTML document. Since you are passing a fragment of HTML, it tries to repair it by creating the <html><head/><body/></html>
parts. My guess is that when libxml2 does this, it changes the character encoding to ISO-8859-1, which is its preferred character set.
When you wrap it, you keep that fixing stuff from firing.
There are several ways of working around this, but the method you have discovered works just fine.
I find a way,it work fine in UTF-8
mb_convert_encoding(htmlqp($path,"body")->find("h2")->text(),"ISO-8859-1","UTF-8");
I found a workaround at http://php.net/manual/en/domdocument.loadhtml.php#95251 It works well for me.
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
foreach ($doc->childNodes as $item) {
if ($item->nodeType == XML_PI_NODE) {
$doc->removeChild($item);
}
}
$doc->encoding = 'UTF-8';
$qp = qp($doc);
Fascinating. That's something I should add to QueryPath. Do you need both the encoding declaration in loadHTML and the $doc->encoding
at the end? Or does just the last one do the trick?
This should definitely be something that's put in QueryPath. I've got a lot of UTF-8 encoded data that I'm working with that I have to perform the workaround on.
To answer you question: The encoding declaration is the only requirement here. I'm able to reproduce this against a lot of datapoints. $doc->encoding is not required whenever I use this.
really helps!!!!finally works with chinese in utf8.Thanks for your work
Are people experiencing this problem using html5()
instead of html()
?
@technosophos html5()
seems to work
header("Content-Type: text/plain;");
$kw = "Водка";
$html = "<div>Водка</div>";
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$doc->encoding = 'UTF-8';
$qp = htmlqp($html);
echo "html(): ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");
$qp = htmlqp($doc);
echo "html() on DOMDocument: ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");
$qp = html5qp($html);
echo "html5(): ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");
Hey @technosophos,
Before going into my issue just wanted to say I love your work on QueryPath!
As for the issue I was wondering if you would have any advice on what I could be doing wrong and why QueryPath seems to be ignoring the fact that a string is valid UTF-8.
Where the content of $dbRow->BreadCrumbHTML is:
and the string I get returned for $taxonomy is:
If I don't use querypath and just get the whole text the UTF-8 is maintained. I did also check to make sure mb_convert_encoding is being called and it does work and maintain the UTF-8 Encoding at that point in xdebug (PHP 5.3.9). Would you have any sagely advice on this on particular routes to further debug?