Closed ndrswlkr closed 1 year ago
Interesting, that has to be a bug in the HTML parser port. Because i've used very large HTML files to test the parser, and it's much faster than the Perl version.
Looks like this is a problem with broken HTML triggering an edge case in the V8 regex engine that makes it match very very slowly. Here's a test case to trigger it:
t.test('Runaway "<"', t => {
const dom = new DOM(`
<table>
<tr>
<td>
<div class="test" data-id="123" data-score="3">works</div>
TEST 123<br />
Test 12-34-5 test >= 75% and < 85% test<br />
Test 12-34-5 -test foo >= 5% and < 30% test<br />
Test 12-23-4 n/a >=13% and = 1% and < 5% test tset<br />
Test 12-34-5 test >= 1% and < 5% foo, bar, baz<br />
Test foo, bar, baz 123-456-78 test < 1% foo, bar, baz yada, foo, bar and baz, yada
</td>
</tr>
</table>
`);
t.equal(dom.at('.test').text(), 'works');
t.end();
});
And it's this regex:
const ATTR_RE = new RegExp(`([^<>=\\s/]+|/)(?:\\s*=\\s*(?:(?<quote>["'])(.*?)\\k<quote>|([^>\\s]*)))?\\s*`, 'ys');
const TAG_RE = new RegExp(`<\\s*(\\/)?\\s*([^<>\\s]+)\\s*((?:${ATTR_RE.source})*?)>`, 'ys');
The Perl regex engine does require some recursion workaround here too, so i feared there could be similar problems with V8. And it seems you've uncovered one. It's probably the attribute matching going over the whole remainder of the file after a broken<
in the text section.
It's not a perfect fix, but at least it covers this case. There will probably be others like it though.
Got a better solution now that covers all known backtracking cases. https://github.com/mojolicious/dom.js/commit/9c88522bc2a0b9acfd7846fc9be414f4bc34e520
i have a couple of html files that dom.js seems not to be able to parse. the program does not freeze, dos'nt throw any exceptoins but gets verry busy (100% cpu) and never returns. When i try the same html with the perl version of dom it parses instantly. if you are interessted in the html sample, i have put it here: https://github.com/ndrswlkr/large-html-file.git