taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

PRE block child nodes not parsed & DOCTYPE not recognized #78

Closed nonara closed 3 years ago

nonara commented 3 years ago

Hi! First, thanks for the great work on this library! This is truly the fastest and best out there.

I needed a lightning fast html-to-markdown library, and everything out there was extremely slow. With the help of node-html-parser, I was able to write something that blazes through it! (I'll be writing a readme soon, but it is a full release)

In the process, I discovered two issues, which I'll detail below.

PRE blocks don't parse children

In pre-formatted elements, contents are always being treated as text, so child-nodes do not get parsed.

This causes an issue for multi-line code blocks. Per spec, a multi-line code block is <code> wrapped in <pre>. (see: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/code#Notes)

The following treats <code ... > as a text node, where spec behaviour is to allow HTML child nodes within pre

<pre>
  <code class='language-javascript'>
    console.log('hello');
  </code>
</pre>

You can see how we workaround this, here: https://github.com/crosstype/node-html-markdown/blob/master/src/config.ts#L79-L89

DOCTYPE is parsed as a text node

<!DOCTYPE ...> nodes are being parsed as text nodes. I think the desired behaviour here would be to simply ignore the node.

To work around this, we're temporarily replacing them ahead of time. (see: https://github.com/crosstype/node-html-markdown/blob/master/src/main.ts#L40-L42)

Side-note

This and another of my libraries broke after a recent yarn install, due to the fact that the options we passed were no longer valid. We were specifying { pre: true, style: false }

Not a major issue, but if you're following semver, changing public API options is generally considered a breaking change, as it could cause projects which use it as a dependency to fail to compile.

Thanks again for the great work, and I hope you have a good day!

taoqf commented 3 years ago
  1. tag pre was defined as BlockTextElements, I don't know what will happen if I remove it from blocktextelements. maybe it will slow down the speed in most use cases, maybe not. anyway, I have added an option blockTextElements to resolve this and I hope this would not cause much trouble. 1.It will break something if I ignore tag doctype.someone may use this parse a whole page, and convert back to text after they changed something.so I'm sorry I can't add this yet.
nonara commented 3 years ago

I have added an option blockTextElements to resolve this and I hope this would not cause much trouble.

That will work. Thanks for such a fast response!

It will break something if I ignore tag doctype. ...

I understand if you don't want to ignore it. That makes sense. I should have probably clarified the problem better.

There are three types of nodes that this parser is creating during parsing.

  1. ELEMENT_NODE
  2. TEXT_NODE
  3. COMMENT_NODE

It turns out that !DOCTYPE isn't any of these. Instead, it's a DOCUMENT_TYPE_NODE (see: https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType#Constants)

The difficulty here is that this parser is treating DOCTYPE as a TEXT_NODE, which makes it output the entire thing (ie. <!DOCTYPE html>) as text.

In this case, if you'd like to preserve it, it would likely be best to create a proper DOCUMENT_TYPE_NODE. Or, worst case an ELEMENT_NODE with the tagName !DOCTYPE.

Otherwise, anything which assumes a TEXT_NODE should always be visible text and never be an HTML tag, will get the side-effect of having HTML code pollute their output.

This should be doable without hurting performance. If you'd like help, let me know and I can look into it and make sure perf numbers aren't affected.

I don't know what will happen if I remove it from blocktextelements. maybe it will slow down the speed in most use cases,

In terms of performance, you'd likely see little to no difference. The reason I'd argue for changing it is because an HTML parser should comply with HTML spec. In the case of pre, child elements are allowed and not considered text. If it weren't for <pre><code></code></pre>, it may not matter as much, but multi-line code blocks are pretty common.

But I want to clarify that I'm not trying to nit-pick. Your blockTextElements solution will work fine for me!

I decided that I'd leave the comment here because due to the difference in this and HTML spec, there is a chance that duplicate issues may be opened. So if in the future this is something you want to look into, I'll be interested in following the conversation.

nonara commented 3 years ago

As a side note, if you'd ever like help in maintaining this, feel free to send a ping any time. I do a lot of work with parsers and compilers. Thanks again for the fast response and fix!

sebromero commented 3 years ago

Hi all! I'm using node-html-parser 2.1.0 with the blockTextElements to get the nodes from code tags inside pre. However, the classNames attribute of those code tags is empty. Is that the expected behaviour?

parser.parse(this.rawHTML, {
    blockTextElements: {
        script: true,
        noscript: true,
        style: true,
        pre: false
      }
});
taoqf commented 3 years ago

@sebromero Yes. If you want get code inside pre, remove pre from blocktextelments, and add add code. like this.

parser.parse(this.rawHTML, {
    blockTextElements: {
        script: true,
        noscript: true,
        style: true,
        code: false
      }
});
sebromero commented 3 years ago

@taoqf That worked perfectly, thank you! It wasn't fully clear to me from the documentation how to use the config object. I thought those were boolean flags on whether or not to parse those elements as blockTextElements. 😅

dlemfh commented 2 years ago

Hi, from my tests on using this library

parse(SOME_HTML_TEXT, { blockTextElements: { pre: true }}) causes

parse(SOME_HTML_TEXT, { blockTextElements: { }}) causes

parse(SOME_HTML_TEXT, { blockTextElements: { pre: false }}) causes

Would I be correct to assume this is the intended behavior? If there's anything I'm missing I'd appreciate any heads up!

nonara commented 2 years ago

@dlemfh It looks like the logic is that the presence of the key indicates that it's block text, and the value (true or false) indicates whether its contents should be ignored or preserved.

In other words, it seems that it's behaving as expected. If you want to not treat pre as block text, you should omit it entirely from the object. Hope that helps!

dlemfh commented 2 years ago

@nonara Seems that way. Thanks!!

bogdanionitabp commented 6 months ago

But it seems you must still pass an empty object as the blockTextElements, otherwise the pre elements will not be parsed. If this was intended, then it's too confusing, otherwise it's too buggy.