technosophos / querypath

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources.
http://querypath.org
Other
822 stars 114 forks source link

PHP Warning: DOMNode::cloneNode(): ID u1 already defined in .../src/QueryPath/DOMQuery.php on line 3176 #168

Open x-yuri opened 9 years ago

x-yuri commented 9 years ago
#!/usr/bin/env php                                                                                                     
<?php                                                                                                                  
require 'vendor/autoload.php';                                                                                         
$qp = htmlqp('                                                                                                         
<!doctype html>                                                                                                        
<html>                                                                                                                 
<body>                                                                                                                 
<ul id="u1">                                                                                                           
    <li>                                                                                                               
    <li>                                                                                                               
</ul>                                                                                                                  
</body>                                                                                                                
</html>                                                                                                                
', '#u1')->children();
$ php --version
PHP 5.6.9 (cli) (built: May 15 2015 10:24:33) 
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies
$ php -i
...
libxml Version => 2.9.2
...

What am I doing wrong?

UPD Well, with php-5.4.10 and libxml-2.7.8 (debian squeeze) and php-5.6.7 and libxml-2.9.2 (debian jessie) it doesn't trigger warnings. The warning is supposedly triggered here. I can't find --without-valid in debian directory of the source package, so it must have nothing to do with this particular configure option. What else to check?

technosophos commented 9 years ago

Whoa. That looks like a bug in libxml (or PHP's usage of it). Calling children() should not clone any nodes at all.

The problem is that a DOM element can't have a duplicate of an existing ID attribute. Something internally is cloning the ul with the ID attribute, and it's causing the failure you saw.

x-yuri commented 9 years ago

Here's the backtrace:

#0  QueryPath\DOMQuery->cloneAll() called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3195]
#1  QueryPath\DOMQuery->__clone() called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3151]
#2  QueryPath\DOMQuery->inst(SplObjectStorage Object (), , Array ([ignore_parser_warnings] => 1,[convert_to_encoding] => ISO-8859-1,[convert_from_encoding] => auto,[use_parser] => html,[parser_flags] => ,[omit_xml_declaration] => ,[replace_entities] => ,[exception_level] => 771,[escape_xhtml_js_css_sections] => /* \1 */)) called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:2075]
#3  QueryPath\DOMQuery->children() called at [/home/yuri/_/2/2.php:14]

And yet it clones. Moreover, on those two debian boxes, it executes cloneAll as well, but triggers no warnings.

UPD More precisely, it executes cloneNode on one element, but triggers no warnings. One possible explanation would that php started displaying this warning.

technosophos commented 9 years ago

Hmm. Yes, I see. I'll need to look at this. I'm having a hard time seeing how one version of PHP could choke on this code, while others are fine.

On Mon, Jun 8, 2015 at 4:04 PM, x-yuri notifications@github.com wrote:

Here's the bactrace:

0 QueryPath\DOMQuery->cloneAll() called at [/home/yuri/_/2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3195]

1 QueryPath\DOMQuery->_clone() called at [/home/yuri//2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:3151]

2 QueryPath\DOMQuery->inst(SplObjectStorage Object (), , Array ([ignore_parser_warnings] => 1,[convert_to_encoding] => ISO-8859-1,[convert_from_encoding] => auto,[use_parser] => html,[parser_flags] => ,[omit_xml_declaration] => ,[replace_entities] => ,[exception_level] => 771,[escape_xhtml_js_csssections] => /* \1 */)) called at [/home/yuri//2/vendor/querypath/querypath/src/QueryPath/DOMQuery.php:2075]

3 QueryPath\DOMQuery->children() called at [/home/yuri/_/2/2.php:14]

— Reply to this email directly or view it on GitHub https://github.com/technosophos/querypath/issues/168#issuecomment-110156198 .

http://technosophos.com https://github.com/Masterminds

x-yuri commented 9 years ago

I seem to have found the culprit. The package was built for jessie before this commit happened. So this has to do with libxml after all.

technosophos commented 9 years ago

Ah! Good find!

On Mon, Jun 8, 2015 at 4:49 PM, x-yuri notifications@github.com wrote:

I seem to have found the culprit https://git.gnome.org/browse/libxml2/commit/valid.c?id=a16eb968075a82ec33b2c1e77db8909a35b44620. The package was built for jessie before this commit happened.

— Reply to this email directly or view it on GitHub https://github.com/technosophos/querypath/issues/168#issuecomment-110165560 .

http://technosophos.com https://github.com/Masterminds

x-yuri commented 9 years ago

And the only workaround I can think of right now is removing id before doing anything else:

htmlqp('...', '#u1')->removeAttr('id')->children();

Unless the children have ids that is :)

x-yuri commented 9 years ago

In which case the best I could think of is this:

function fix_children($el) {                                                                                           
    foreach ((new DOMXPath($el->document()))->query('.//*[@id]') as $_el) {                                            
        $_el->removeAttribute('id');                                                                                   
    }                                                                                                                  
    return $el;                                                                                                        
}
fix_children(htmlqp('...', '#u1')->removeAttr('id'))->children();
technosophos commented 9 years ago

Unfortunately, that's probably what you'll have to do. I guess you could simply rename the attribute from id to something else. Only id is treated as special by the libxml library.

x-yuri commented 8 years ago

And even better workaround probably would be:

#!/usr/bin/env php
<?php
require 'vendor/autoload.php';

function set_error_handler_block($block, $error_handler) {
    $prv_error_handler = set_error_handler(function() use (&$prv_error_handler, $error_handler) {
        return call_user_func_array($error_handler, array_merge([$prv_error_handler], func_get_args()));
    });
    try {
        return call_user_func($block);
    } finally {
        restore_error_handler();
    }
}

set_error_handler_block(function() {
    $qp = htmlqp(
        '<!doctype html>
        <html>
        <body>
        <ul id="u1">
            <li>
            <li>
        </ul>
        </body>
        </html>
        ', '#u1')->children();
}, function($prv_error_handler, $errno, $errstr, $errfile, $errline) {
    # printf("error: %u %s\n", $errno, $errstr);
    if ($errno == E_WARNING
    && preg_match('/^DOM.+?::.+?\(\): ID .*? already defined/', $errstr))
        return;   # ignore error
    return $prv_error_handler
        ? call_user_func_array($prv_error_handler, array_slice(func_get_args(), 1))
        : FALSE;
});

One might need to tailor regexp to one's needs though.

marcimat commented 8 years ago

Same problem here, also with $qp->top(),

[edit] A correction has been commited into libxml yesterday : https://bugzilla.gnome.org/show_bug.cgi?id=737840#c9

technosophos commented 8 years ago

FWIW, if you use the HTML5 parser, you will not hit this error, since that uses a native PHP parser.

muka commented 8 years ago

Hi, this happens to me too. How may we use the HTML5 parser?

UPDATE: Ok, sorted out

composer update querypath/QueryPath dev-master $crawler = \QueryPath::withHTML5($raw);

Thank you Luca

logbon72 commented 8 years ago

libxml_use_internal_errors(true);

Suppressed the error message in my case.