taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

a tag nesting breaks everything #211

Closed lehtu closed 2 years ago

lehtu commented 2 years ago

It seems that this lib cannot handle a tag nesting. If there is an a tag inside another one and there are other tag openings in between, the parser will force close all tags between two a tag openings.

This is of course not valid HTML, although it's semantically valid, but still I thought that parser would do what it promises. As a developer I want to be able to parse HTML no matter who has produced it, especially if it's semantically alright.

Is there any workarounds for this?

example1:

import { parse } from 'node-html-parser';

const root = parse(`
  <a>
    <table>
      <tr>
        <td>
          <a href="https://github.com">link to github</a>
        </td>
      </tr>
    </table>
  </a>
`);

console.log(root.toString());
// outputs
<a>
<table>
  <tr>
    <td>
      </td></tr></table></a><a href="https://github.com">link to github</a>

example2:

import { parse } from 'node-html-parser';

const root = parse(`
    <a>
      <a href="https://github.com">link to github</a>
    </a>
`);

console.log(root.toString());
//outputs:
    <a>
      </a><a href="https://github.com">link to github</a>
nonara commented 2 years ago

Hi @lehtu (and @taoqf)

I responded on another thread about this, but I'll update here, as this seems to be the root thread on the issue.

I had a look into other parsers, and it seems that there are two ways that this is handled. It can be handled as-is, despite being invalid HTML, or it can be corrected, as we are doing since #148

Most parsers do allow the broken html, but notably parse5 does not, because it's parsing in a way that produces an HTML document. Additionally, if you insert the same code into HTML in Chrome debugger, you'll see that it corrects the tags and makes them sequential, like we are doing.

Ultimately, though, I think that in our specific scenario, the former behaviour (not correcting A tags) is the right course of action. It would really only make sense for us to correct the DOM if the point of the parser was to produce a fully compliant HTML spec, which would entail correcting all issues, not just one. That is not the purpose of this parser, however.

As a result, we can fix it by reverting changes made in #148, however, this is a breaking change. With that in mind, because some other libraries such as node-html-markdown rely on the A tag fixing to happen during the initial parsing, I will add a legacy option to enable this, but disable it by default. We will release the changes as a new major version, so as not to break anything.

The TLDR is — we'll fix it and release as a new major version.

I will handle this on Sunday.

nonara commented 2 years ago

Closed by #215

Released in v6.0.0