tony-o / perl6-html-parser-xml

html -> xml::document converter
2 stars 5 forks source link

Parsing tables doesn't work quite right #6

Closed hoelzro closed 9 years ago

hoelzro commented 9 years ago

When parsing an HTML document with a table, a lot of the rows and cells are not picked up. For example, when I try to print the table element in this document:

<html>
  <body>
    <table>
      <tr>
        <td>first</td>
      </tr>
      <tr>
        <td>second</td>
      </tr>
    </table>
  </body>
</html>

...I get this:

<table> <tr> </tr> <td> <span>first</span>  </td>  </table>

Here's a test script:

use HTML::Parser::XML;
use Test;

my $html = q:to/END_HTML/;
<html>
  <body>
    <table>
      <tr>
        <td>first</td>
      </tr>
      <tr>
        <td>second</td>
      </tr>
    </table>
  </body>
</html>
END_HTML

sub traverse($doc) {
    sub helper($node) {
        take $node;

        for $node.?nodes -> $child {
            helper($child);
        }
    }

    gather {
        helper($doc.?root // $doc);
    }
}

my $doc = HTML::Parser::XML.new.parse($html);
for traverse($doc) -> $node {
    if $node ~~ XML::Element {
        if $node.name eq 'table' {
            my %tag-count;
            for traverse($node) -> $subnode {
                my $name = $subnode.?name;
                %tag-count{$name}++ if $name;
            }
            is %tag-count<table>, 1;
            is %tag-count<tr>,    2;
            is %tag-count<td>,    2;
        }
    }
}

done;
tony-o commented 9 years ago

It certainly isn't working, taking a look @hoelzro - thank you for the bug report and example

tony-o commented 9 years ago

I get the following output with the fix I just pushed to git (3ca62d76ef9b5d37a031e89d23a430a80b5f1390)

<table> <tr> <td>first</td> </tr> <tr> <td>second</td> </tr> </table>

hoelzro commented 9 years ago

Yup, this works for me!