new DOM(content) not returning when parsing some large files

ndrswlkr commented 1 year ago

i have a couple of html files that dom.js seems not to be able to parse. the program does not freeze, dos'nt throw any exceptoins but gets verry busy (100% cpu) and never returns. When i try the same html with the perl version of dom it parses instantly. if you are interessted in the html sample, i have put it here: https://github.com/ndrswlkr/large-html-file.git

kraih commented 1 year ago

Interesting, that has to be a bug in the HTML parser port. Because i've used very large HTML files to test the parser, and it's much faster than the Perl version.

kraih commented 1 year ago

Looks like this is a problem with broken HTML triggering an edge case in the V8 regex engine that makes it match very very slowly. Here's a test case to trigger it:

  t.test('Runaway "<"', t => {
    const dom = new DOM(`
    <table>
      <tr>
        <td>
          <div class="test" data-id="123" data-score="3">works</div>
          TEST 123<br />
          Test  12-34-5 test  >= 75% and < 85%  test<br />
          Test  12-34-5  -test foo >= 5% and < 30% test<br />
          Test  12-23-4 n/a >=13% and = 1% and < 5% test tset<br />
          Test  12-34-5  test >= 1% and < 5%   foo, bar, baz<br />
          Test foo, bar, baz  123-456-78  test < 1%  foo, bar, baz yada, foo, bar and baz, yada
        </td>
      </tr>
    </table>
    `);
    t.equal(dom.at('.test').text(), 'works');

    t.end();
  });

And it's this regex:

const ATTR_RE = new RegExp(`([^<>=\\s/]+|/)(?:\\s*=\\s*(?:(?<quote>["'])(.*?)\\k<quote>|([^>\\s]*)))?\\s*`, 'ys');
const TAG_RE = new RegExp(`<\\s*(\\/)?\\s*([^<>\\s]+)\\s*((?:${ATTR_RE.source})*?)>`, 'ys');

The Perl regex engine does require some recursion workaround here too, so i feared there could be similar problems with V8. And it seems you've uncovered one. It's probably the attribute matching going over the whole remainder of the file after a broken< in the text section.

kraih commented 1 year ago

It's not a perfect fix, but at least it covers this case. There will probably be others like it though.

kraih commented 1 year ago

Got a better solution now that covers all known backtracking cases. https://github.com/mojolicious/dom.js/commit/9c88522bc2a0b9acfd7846fc9be414f4bc34e520

mojolicious / dom.js

new DOM(content) not returning when parsing some large files #11