taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Nested A tags parsed improperly #144

Closed nonara closed 2 years ago

nonara commented 3 years ago

Synopsis

Given:

<a href="#">link <a href="#">nested link</a> end</a>

This is invalid HTML, and it should be parsed as:

<A>
 |-- <TextNode text="link ">
<A>
 |-- <TextNode text="nested link">
<TextNode text=" end">

However, it's parsed as:

<A>
 |-- <TextNode text="link ">
 |-- <A>
      |-- <TextNode text="nested link">
 |-- <TextNode text=" end">

This is causing issues for the markdown converter.

Spec

Spec dictates that an A tag cannot be a child of another A.

Upon encountering a nested A tag, the parser should consider the present tag terminated and begin a new one. Any further text that occurs after (ie. end, should be considered a TextNode)

This behaviour can be demonstrated via: https://astexplorer.net

Solution

I believe this should be easily solvable, without a performance impact. I will investigate this ASAP and submit a fix. I hope to get to it this weekend.

Related

taoqf commented 3 years ago

Fixed, and I do believe this would reduce the performance a bit. but it should not works incorrectly anyway. Thanks for your report. reopen this if issues are still there.

nonara commented 3 years ago

Thanks for the fast work! I think there will still be an issue, though.

<a href="#"><b><a href="#">link</a></b></a>

My plan was to track the last A tag stack index in a var, and if you hit another, it would adjust the stack from the var idx point and update the var with the new tag.

Also, when closing the A, it should unset the index var.

I'm on mobile, so I cant give a better example, but hopefully that makes sense.

nonara commented 3 years ago

Fix submitted in #148

taoqf commented 2 years ago

https://github.com/taoqf/node-html-parser/issues/211

taoqf commented 2 years ago

It seems #148 caused more issues.

taoqf commented 2 years ago

@nonara Do you have time to take a look at these issues?

nonara commented 2 years ago

Hi @taoqf ! Good to hear from you. Hope all is well.

I will check on the behavior of other parsers. I think how we handle it matches, but if not, I'll give some thought to the right approach and we can discuss a way to correct it

taoqf commented 2 years ago

@nonara Thanks, really.

nonara commented 2 years ago

Resolved in #215