taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

parse & parseNoneClosedTags invalid behaviour #231

Open bujhmt opened 1 year ago

bujhmt commented 1 year ago

Hello, @taoqf! parseNoneClosedTags property doesn't work properly.

Wrong html fragment:

<div>
    <ul>
        <li>
            <a href="https://example.com">1</a>
            <span class="cat-count-span">(1)
        </li>
        <li><a href="https://example.com">2</a><span class="cat-count-span">(1)</li>
        <li><a href="https://example.com">3</a><span class="cat-count-span">(1)</li>
        <li><a href="https://example.com">4</a><span class="cat-count-span">(1)</li>
        <li><a href="https://example.com">5</a><span class="cat-count-span">(1)</li>
        <li><a href="https://example.com">6</a><span class="cat-count-span">(1)</li>
        <li><a href="https://example.com">7</a><span class="cat-count-span">(1)</li>
        <li><a href="https://example.com">8</a><span class="cat-count-span">(1)</li>
    </ul>
</div>

Browser fixed output (from devtools):

...
    <li>
        <a href="https://example.com">1</a>
        <span class="cat-count-span">(1)</span>
    </li>
...
const output = parse(html, {comment: false, parseNoneClosedTags: true})

Library output:

<div>
    <ul>
        <li>
            <a href="https://example.com">1</a>
            <span class="cat-count-span">(1)

        <li><a href="https://example.com">2</a><span class="cat-count-span">(1)
        <li><a href="https://example.com">3</a><span class="cat-count-span">(1)
        <li><a href="https://example.com">4</a><span class="cat-count-span">(1)
        <li><a href="https://example.com">5</a><span class="cat-count-span">(1)
        <li><a href="https://example.com">6</a><span class="cat-count-span">(1)
        <li><a href="https://example.com">7</a><span class="cat-count-span">(1)
        <li><a href="https://example.com">8</a><span class="cat-count-span">(1)
</span></li>
        </span></li></span></li></span></li></span></li></span></li></span></li></span></li></ul>
</div>

On the other hand, if I parse large html with this "span issue" and use parse without parseNoneClosedTags property, I will get infinite loop inside library.

taoqf commented 1 year ago

https://github.com/taoqf/node-html-parser/issues/152