mojolicious / dom.js

:crystal_ball: A fast and very small HTML/XML DOM parser with CSS selectors
https://www.npmjs.com/package/@mojojs/dom
MIT License
20 stars 3 forks source link

Two more "edge cases" #13

Closed ndrswlkr closed 1 year ago

ndrswlkr commented 1 year ago

Two more "edge cases"...

cannot_parse_2.html.gz cannot_parse_1.html.gz

kraih commented 1 year ago

I bet it's more cases involving .*?, it seems the V8 regex engine really doesn't like those.

kraih commented 1 year ago

Nope, it's the tag regex again. Seems we need larger changes to make it work. It can be replicated with this script:

import DOM from './lib/dom.js';

const html = `
<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Test</title>
    <meta property="og:description" content="test test test test test test test test test test test test test test test
    test test test test test test test test test test test test test test testtest test test test test test test test
    test test test test test test test test test test test 'test test test testtest test test test test test test test
    test test test test test test test test test test test 'test test test testtest test test test test test test test
    test test test test test test test test test test test test test test testtest test test test test test test test
    test test test test "test test test test test test" test test test test testtest test test test test test test test
    test test test test test test 'test test test test test test test test testtest test test test test test test test
    test test test test test test test test test test test test test test testtest test test test test test test test
    test test test test test test test test test test test test test test testtest test test test test test test test
    test test" />
   </head>
</html>
`;
const dom = new DOM(html);
kraih commented 1 year ago

My guess is that the tag regex matching needs to be split up into multiple smaller regexes.

kraih commented 1 year ago

This should be fixed with the rewrite of the tokenizer. https://github.com/mojolicious/dom.js/commit/007476cdf50971cca6f882aaf01dee8a725f540b