technosophos / querypath

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources.
http://querypath.org
Other
823 stars 115 forks source link

htmlqp() converts euro signs and hyphens to question marks. #163

Closed codecowboy closed 8 years ago

codecowboy commented 9 years ago

How to I preserve encoding? The supported options seem to suggest that no encoding is done without explicitly specifying it in the options array:

$the_content = get_the_content();
$qp = htmlqp($the_content, array());
$output = $qp->find('p')->addclass('testing');
$output->writeHTML();

<p>The exact price ...projects usually start at €15,000 + VAT@19%**</p>

becomes

<p class="testing">The exact price ...projects usually start at ?15,000 + VAT@19%**</p>
technosophos commented 9 years ago

The libxml library that PHP uses for representing HTML DOMs uses ISO-8859-1 as its default encoding. Take a look at these tests, and see if that helps: https://github.com/technosophos/querypath/blob/master/test/Tests/QueryPath/DOMQueryTest.php#L155

You should be able to pass in the desired encoding (UTF-8?) and have that work.

It may also be worth your while to try out html5-php for the parser. (https://github.com/Masterminds/html5-php). Newer versions of QueryPath are moving to that parser instead of the built-in one, and it should do UTF-8 out of the box.

codecowboy commented 9 years ago

Thanks for the speedy reply. I've tried -

$qp = htmlqp($the_content, null, array('convert_to_encoding' => 'UTF-8')); (also tried lowercase)

or $qp = htmlqp($the_content, null, array('encoding' => 'UTF-8')); (also tried lowercase)

My use case is just to split an html string on a particular element so that I can render it in two separate chunks. Can this library do that? I want to just strip a couple of paragraphs off the end, store them in a variable and then output them later.

technosophos commented 9 years ago

You may want 'convert_from_encoding', which tells the parser what the encoding of the source document is. https://github.com/technosophos/querypath/blob/master/src/QueryPath/DOMQuery.php#L3717

But still, something does not seem right.

codecowboy commented 9 years ago

Having viewed the source of the page, I think it might be because writeHTML() is actually writing a whole new html document. What I'm inputting is just an HTML string (a WordPress post) so I would want to omit any <html> tags or doctype declaration in the output.

technosophos commented 9 years ago

If you're running against Master, will you test writeHTML5()? That uses my new serializer, and it might work better, since it is supposed to be UTF-8 all the way.

You could also try loading via html5qp().

codecowboy commented 9 years ago

Thanks! This seemed to work:

$the_content = get_the_content();
$qp = html5qp( $the_content,  null, array('convert_from_encoding' => 'utf-8'));
$output = $qp->find('p')->addclass('testing');
$output->writeHTML5();

However, is there a way to prevent it wrapping everything in `<!DOCTYPE html>

`?
technosophos commented 9 years ago

IIRC, you can do:

html5qp(\QueryPath::HTML5_STUB, 'body')->append($the_content);

You can also use HTML5-PHP directly, and use \HTML5::loadHTMLFragment() https://github.com/Masterminds/html5-php/blob/master/src/HTML5.php#L133

There's an example here: https://github.com/Masterminds/html5-php/blob/master/test/HTML5/Html5Test.php#L98

From there, you can just pass the DOM into QueryPath:

$html5 = new HTML5();
$dom = $html5->loadHTMLFragment($the_content);
$qp = html5qp($dom);
$output = $qp->find('p')->addclass('testing');
$output->writeHTML5();
codecowboy commented 9 years ago

Thanks! Haven't had a chance to try it yet but sounds like it will do the job.