ressio / pharse

Fastest PHP HTML Parser
83 stars 15 forks source link

Incorrect parsing for children of children #24

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The following functions will return all children in the DOM object. However, it 
looks like if there is text between the nested tags it sometimes misses a 
child. For example, <div>Hello<span>world</span></div> will miss the span data. 
Also, the ability to dump the DOM into a JSON obejct as provided below would be 
a nice feature.

function get_all_children($el) {
    $output = array();
    $row = array(
        'name' => $el->getTag(),
        'raw' => $el->getInnerText()
    );
    for ($i = 0; $i < $el->childCount(); $i++) {
        $row['children'] = get_all_children($el->getChild($i));
    }
    foreach($el->attributes as $attr => $value) {
        $row['attribs'] = array(
            $attr => $value
        );
    }
    array_push($output, $row);
    return $output;
}

function get_dom_array($html, $selector) {
    $output = array();
    foreach($html($selector) as $el) {
        $row = array(
            'name' => $el->getTag(),
            'raw' => $el->getInnerText()
        );
        for ($i = 0; $i < $el->childCount(); $i++) {
            $row['children'] = get_all_children($el->getChild($i));
        }
        foreach($el->attributes as $attr => $value) {
            $row['attribs'] = array(
                $attr => $value
            );
        }
        array_push($output, $row);
    }
    return $output;
}

$html = str_get_dom('<html><body><div>Hello World</div></body></html>');
$dom_array = get_dom_array($html, 'div');
echo json_encode($dom_array);

Original issue reported on code.google.com by sjwood...@gmail.com on 18 Oct 2012 at 1:47

GoogleCodeExporter commented 9 years ago
After re-reading Issue #23, this might possibly be related. Not 100% sure.

Original comment by sjwood...@gmail.com on 18 Oct 2012 at 1:56

GoogleCodeExporter commented 9 years ago
In your example code, what would be the expected output? I believe span should 
be missing as it's not in your input string?

Original comment by niels....@gmail.com on 19 Oct 2012 at 5:32

GoogleCodeExporter commented 9 years ago
My apologies... After more testing, all children are being handled correctly. 
The error came from some really messy HTML input. Thanks for the script!

Original comment by sjwood...@gmail.com on 19 Oct 2012 at 5:59

GoogleCodeExporter commented 9 years ago

Original comment by niels....@gmail.com on 19 Oct 2012 at 6:27