UTF-8 Encoding not being respected

sylus commented 12 years ago

Hey @technosophos,

Before going into my issue just wanted to say I love your work on QueryPath!

As for the issue I was wondering if you would have any advice on what I could be doing wrong and why QueryPath seems to be ignoring the fact that a string is valid UTF-8.

<?php
      // Parse the HTML using QueryPath
      $qp_options = array(
        'convert_from_encoding' => 'UTF-8',
        'convert_to_encoding' => 'UTF-8',
        'strip_low_ascii' => FALSE,
      );

      //Taxonomy
      $this->qp = htmlqp($dbRow->BreadCrumbHTML, NULL, $qp_options);
      $taxonomy = $this->qp->top()->find('ul li:last')->text();

Where the content of $dbRow->BreadCrumbHTML is:

<ul><li style="display:inline;"><a href="/fr/index.html">Accueil</a></li> &gt; <li><a href="/fr/roads_trans/index.html">Routes et transports</a></li>  &gt; <li>Vélo</li></ul>

and the string I get returned for $taxonomy is:

"VÃ©lo"

If I don't use querypath and just get the whole text the UTF-8 is maintained. I did also check to make sure mb_convert_encoding is being called and it does work and maintain the UTF-8 Encoding at that point in xdebug (PHP 5.3.9). Would you have any sagely advice on this on particular routes to further debug?

sylus commented 12 years ago

I found a temporary fix by using the following function and wraping a HTML Stub around $taxonomy... Curious why this makes things work?

<?php
  protected function wrapHTML() {
    // We add surrounding <html> and <head> tags.
    $html = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">';
    $html .= '<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';
    $html .= $this->html;
    $html .= '</body></html>';
    $this->html = $html;
  }

technosophos commented 12 years ago

When you use htmlqp(), QueryPath (libxml, actually) tries to repair anything that looks broken in an HTML document. Since you are passing a fragment of HTML, it tries to repair it by creating the <html><head/><body/></html> parts. My guess is that when libxml2 does this, it changes the character encoding to ISO-8859-1, which is its preferred character set.

When you wrap it, you keep that fixing stuff from firing.

There are several ways of working around this, but the method you have discovered works just fine.

wyqbailey commented 12 years ago

I find a way,it work fine in UTF-8

mb_convert_encoding(htmlqp($path,"body")->find("h2")->text(),"ISO-8859-1","UTF-8");

Leksat commented 10 years ago

I found a workaround at http://php.net/manual/en/domdocument.loadhtml.php#95251 It works well for me.

  $doc = new DOMDocument();
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  foreach ($doc->childNodes as $item) {
    if ($item->nodeType == XML_PI_NODE) {
      $doc->removeChild($item);
    }
  }
  $doc->encoding = 'UTF-8';

  $qp = qp($doc);

technosophos commented 10 years ago

Fascinating. That's something I should add to QueryPath. Do you need both the encoding declaration in loadHTML and the $doc->encoding at the end? Or does just the last one do the trick?

kjenney commented 9 years ago

This should definitely be something that's put in QueryPath. I've got a lot of UTF-8 encoded data that I'm working with that I have to perform the workaround on.

To answer you question: The encoding declaration is the only requirement here. I'm able to reproduce this against a lot of datapoints. $doc->encoding is not required whenever I use this.

xiaotianhu commented 8 years ago

really helps!!!!finally works with chinese in utf8.Thanks for your work

technosophos commented 8 years ago

Are people experiencing this problem using html5() instead of html()?

ghost commented 8 years ago

@technosophos html5() seems to work

header("Content-Type: text/plain;");
$kw = "Водка";
$html = "<div>Водка</div>";
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$doc->encoding = 'UTF-8';

$qp = htmlqp($html);
echo "html(): ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");

$qp = htmlqp($doc);
echo "html() on DOMDocument: ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");

$qp = html5qp($html);
echo "html5(): ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");

technosophos / querypath

UTF-8 Encoding not being respected #94