Closed codecowboy closed 8 years ago
The libxml library that PHP uses for representing HTML DOMs uses ISO-8859-1 as its default encoding. Take a look at these tests, and see if that helps: https://github.com/technosophos/querypath/blob/master/test/Tests/QueryPath/DOMQueryTest.php#L155
You should be able to pass in the desired encoding (UTF-8?) and have that work.
It may also be worth your while to try out html5-php for the parser. (https://github.com/Masterminds/html5-php). Newer versions of QueryPath are moving to that parser instead of the built-in one, and it should do UTF-8 out of the box.
Thanks for the speedy reply. I've tried -
$qp = htmlqp($the_content, null, array('convert_to_encoding' => 'UTF-8'));
(also tried lowercase)
or $qp = htmlqp($the_content, null, array('encoding' => 'UTF-8'));
(also tried lowercase)
My use case is just to split an html string on a particular element so that I can render it in two separate chunks. Can this library do that? I want to just strip a couple of paragraphs off the end, store them in a variable and then output them later.
You may want 'convert_from_encoding', which tells the parser what the encoding of the source document is. https://github.com/technosophos/querypath/blob/master/src/QueryPath/DOMQuery.php#L3717
But still, something does not seem right.
Having viewed the source of the page, I think it might be because writeHTML() is actually writing a whole new html document. What I'm inputting is just an HTML string (a WordPress post) so I would want to omit any <html>
tags or doctype declaration in the output.
If you're running against Master, will you test writeHTML5()
? That uses my new serializer, and it might work better, since it is supposed to be UTF-8 all the way.
You could also try loading via html5qp()
.
Thanks! This seemed to work:
$the_content = get_the_content();
$qp = html5qp( $the_content, null, array('convert_from_encoding' => 'utf-8'));
$output = $qp->find('p')->addclass('testing');
$output->writeHTML5();
However, is there a way to prevent it wrapping everything in `<!DOCTYPE html>
`?IIRC, you can do:
html5qp(\QueryPath::HTML5_STUB, 'body')->append($the_content);
You can also use HTML5-PHP directly, and use \HTML5::loadHTMLFragment()
https://github.com/Masterminds/html5-php/blob/master/src/HTML5.php#L133
There's an example here: https://github.com/Masterminds/html5-php/blob/master/test/HTML5/Html5Test.php#L98
From there, you can just pass the DOM into QueryPath:
$html5 = new HTML5();
$dom = $html5->loadHTMLFragment($the_content);
$qp = html5qp($dom);
$output = $qp->find('p')->addclass('testing');
$output->writeHTML5();
Thanks! Haven't had a chance to try it yet but sounds like it will do the job.
How to I preserve encoding? The supported options seem to suggest that no encoding is done without explicitly specifying it in the options array:
<p>The exact price ...projects usually start at €15,000 + VAT@19%**</p>
becomes