Closed ndrswlkr closed 1 year ago
I bet it's more cases involving .*?
, it seems the V8 regex engine really doesn't like those.
Nope, it's the tag regex again. Seems we need larger changes to make it work. It can be replicated with this script:
import DOM from './lib/dom.js';
const html = `
<!DOCTYPE html>
<html lang="en">
<head>
<title>Test</title>
<meta property="og:description" content="test test test test test test test test test test test test test test test
test test test test test test test test test test test test test test testtest test test test test test test test
test test test test test test test test test test test 'test test test testtest test test test test test test test
test test test test test test test test test test test 'test test test testtest test test test test test test test
test test test test test test test test test test test test test test testtest test test test test test test test
test test test test "test test test test test test" test test test test testtest test test test test test test test
test test test test test test 'test test test test test test test test testtest test test test test test test test
test test test test test test test test test test test test test test testtest test test test test test test test
test test test test test test test test test test test test test test testtest test test test test test test test
test test" />
</head>
</html>
`;
const dom = new DOM(html);
My guess is that the tag regex matching needs to be split up into multiple smaller regexes.
This should be fixed with the rewrite of the tokenizer. https://github.com/mojolicious/dom.js/commit/007476cdf50971cca6f882aaf01dee8a725f540b
Two more "edge cases"...
cannot_parse_2.html.gz cannot_parse_1.html.gz