Open x-yuri opened 9 years ago
Whoa. That looks like a bug in libxml (or PHP's usage of it). Calling children() should not clone any nodes at all.
The problem is that a DOM element can't have a duplicate of an existing ID attribute. Something internally is cloning the ul with the ID attribute, and it's causing the failure you saw.
Here's the backtrace:
#0 QueryPath\DOMQuery->cloneAll() called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3195]
#1 QueryPath\DOMQuery->__clone() called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3151]
#2 QueryPath\DOMQuery->inst(SplObjectStorage Object (), , Array ([ignore_parser_warnings] => 1,[convert_to_encoding] => ISO-8859-1,[convert_from_encoding] => auto,[use_parser] => html,[parser_flags] => ,[omit_xml_declaration] => ,[replace_entities] => ,[exception_level] => 771,[escape_xhtml_js_css_sections] => /* \1 */)) called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:2075]
#3 QueryPath\DOMQuery->children() called at [/home/yuri/_/2/2.php:14]
And yet it clones. Moreover, on those two debian
boxes, it executes cloneAll
as well, but triggers no warnings.
UPD More precisely, it executes cloneNode
on one element, but triggers no warnings. One possible explanation would that php
started displaying this warning.
Hmm. Yes, I see. I'll need to look at this. I'm having a hard time seeing how one version of PHP could choke on this code, while others are fine.
On Mon, Jun 8, 2015 at 4:04 PM, x-yuri notifications@github.com wrote:
Here's the bactrace:
0 QueryPath\DOMQuery->cloneAll() called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3195]
1 QueryPath\DOMQuery->_clone() called at [/home/yuri//2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3151]
2 QueryPath\DOMQuery->inst(SplObjectStorage Object (), , Array ([ignore_parser_warnings] => 1,[convert_to_encoding] => ISO-8859-1,[convert_from_encoding] => auto,[use_parser] => html,[parser_flags] => ,[omit_xml_declaration] => ,[replace_entities] => ,[exception_level] => 771,[escape_xhtml_js_csssections] => /* \1 */)) called at [/home/yuri//2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:2075]
3 QueryPath\DOMQuery->children() called at [/home/yuri/_/2/2.php:14]
— Reply to this email directly or view it on GitHub https://github.com/technosophos/querypath/issues/168#issuecomment-110156198 .
I seem to have found the culprit. The package was built for jessie
before this commit happened. So this has to do with libxml
after all.
Ah! Good find!
On Mon, Jun 8, 2015 at 4:49 PM, x-yuri notifications@github.com wrote:
I seem to have found the culprit https://git.gnome.org/browse/libxml2/commit/valid.c?id=a16eb968075a82ec33b2c1e77db8909a35b44620. The package was built for jessie before this commit happened.
— Reply to this email directly or view it on GitHub https://github.com/technosophos/querypath/issues/168#issuecomment-110165560 .
And the only workaround I can think of right now is removing id before doing anything else:
htmlqp('...', '#u1')->removeAttr('id')->children();
Unless the children have id
s that is :)
In which case the best I could think of is this:
function fix_children($el) {
foreach ((new DOMXPath($el->document()))->query('.//*[@id]') as $_el) {
$_el->removeAttribute('id');
}
return $el;
}
fix_children(htmlqp('...', '#u1')->removeAttr('id'))->children();
Unfortunately, that's probably what you'll have to do. I guess you could simply rename the attribute from id to something else. Only id
is treated as special by the libxml library.
And even better workaround probably would be:
#!/usr/bin/env php
<?php
require 'vendor/autoload.php';
function set_error_handler_block($block, $error_handler) {
$prv_error_handler = set_error_handler(function() use (&$prv_error_handler, $error_handler) {
return call_user_func_array($error_handler, array_merge([$prv_error_handler], func_get_args()));
});
try {
return call_user_func($block);
} finally {
restore_error_handler();
}
}
set_error_handler_block(function() {
$qp = htmlqp(
'<!doctype html>
<html>
<body>
<ul id="u1">
<li>
<li>
</ul>
</body>
</html>
', '#u1')->children();
}, function($prv_error_handler, $errno, $errstr, $errfile, $errline) {
# printf("error: %u %s\n", $errno, $errstr);
if ($errno == E_WARNING
&& preg_match('/^DOM.+?::.+?\(\): ID .*? already defined/', $errstr))
return; # ignore error
return $prv_error_handler
? call_user_func_array($prv_error_handler, array_slice(func_get_args(), 1))
: FALSE;
});
One might need to tailor regexp to one's needs though.
Same problem here, also with $qp->top()
,
[edit] A correction has been commited into libxml yesterday : https://bugzilla.gnome.org/show_bug.cgi?id=737840#c9
FWIW, if you use the HTML5 parser, you will not hit this error, since that uses a native PHP parser.
Hi, this happens to me too. How may we use the HTML5 parser?
UPDATE: Ok, sorted out
composer update querypath/QueryPath dev-master
$crawler = \QueryPath::withHTML5($raw);
Thank you Luca
libxml_use_internal_errors(true);
Suppressed the error message in my case.
What am I doing wrong?
UPD Well, with
php-5.4.10
andlibxml-2.7.8
(debian squeeze
) andphp-5.6.7
andlibxml-2.9.2
(debian jessie
) it doesn't trigger warnings. The warning is supposedly triggered here. I can't find--without-valid
indebian
directory of the source package, so it must have nothing to do with this particularconfigure
option. What else to check?