Closed nonara closed 3 years ago
pre
was defined as BlockTextElements
, I don't know what will happen if I remove it from blocktextelements. maybe it will slow down the speed in most use cases, maybe not. anyway, I have added an option blockTextElements
to resolve this and I hope this would not cause much trouble.
1.It will break something if I ignore tag doctype
.someone may use this parse a whole page, and convert back to text after they changed something.so I'm sorry I can't add this yet.I have added an option blockTextElements to resolve this and I hope this would not cause much trouble.
That will work. Thanks for such a fast response!
It will break something if I ignore tag
doctype
. ...
I understand if you don't want to ignore it. That makes sense. I should have probably clarified the problem better.
There are three types of nodes that this parser is creating during parsing.
It turns out that !DOCTYPE
isn't any of these. Instead, it's a DOCUMENT_TYPE_NODE
(see: https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType#Constants)
The difficulty here is that this parser is treating DOCTYPE as a TEXT_NODE
, which makes it output the entire thing (ie. <!DOCTYPE html>
) as text.
In this case, if you'd like to preserve it, it would likely be best to create a proper DOCUMENT_TYPE_NODE
. Or, worst case an ELEMENT_NODE
with the tagName !DOCTYPE
.
Otherwise, anything which assumes a TEXT_NODE
should always be visible text and never be an HTML tag, will get the side-effect of having HTML code pollute their output.
This should be doable without hurting performance. If you'd like help, let me know and I can look into it and make sure perf numbers aren't affected.
I don't know what will happen if I remove it from blocktextelements. maybe it will slow down the speed in most use cases,
In terms of performance, you'd likely see little to no difference. The reason I'd argue for changing it is because an HTML parser should comply with HTML spec. In the case of pre
, child elements are allowed and not considered text. If it weren't for <pre><code></code></pre>
, it may not matter as much, but multi-line code blocks are pretty common.
But I want to clarify that I'm not trying to nit-pick. Your blockTextElements
solution will work fine for me!
I decided that I'd leave the comment here because due to the difference in this and HTML spec, there is a chance that duplicate issues may be opened. So if in the future this is something you want to look into, I'll be interested in following the conversation.
As a side note, if you'd ever like help in maintaining this, feel free to send a ping any time. I do a lot of work with parsers and compilers. Thanks again for the fast response and fix!
Hi all! I'm using node-html-parser 2.1.0 with the blockTextElements to get the nodes from code
tags inside pre
. However, the classNames attribute of those code tags is empty. Is that the expected behaviour?
parser.parse(this.rawHTML, {
blockTextElements: {
script: true,
noscript: true,
style: true,
pre: false
}
});
@sebromero Yes. If you want get code inside pre, remove pre
from blocktextelments, and add add code
. like this.
parser.parse(this.rawHTML, {
blockTextElements: {
script: true,
noscript: true,
style: true,
code: false
}
});
@taoqf That worked perfectly, thank you! It wasn't fully clear to me from the documentation how to use the config object. I thought those were boolean flags on whether or not to parse those elements as blockTextElements. 😅
Hi, from my tests on using this library
parse(SOME_HTML_TEXT, { blockTextElements: { pre: true }})
causes
pre
element to be an ELEMENT_NODEpre
to be a single TEXT_NODEparse(SOME_HTML_TEXT, { blockTextElements: { }})
causes
pre
element to be an ELEMENT_NODEpre
to be parsed as ELEMENT_NODEs as wellparse(SOME_HTML_TEXT, { blockTextElements: { pre: false }})
causes
pre
element to be an ELEMENT_NODEpre
element is parsed to have no children at allWould I be correct to assume this is the intended behavior? If there's anything I'm missing I'd appreciate any heads up!
@dlemfh It looks like the logic is that the presence of the key indicates that it's block text, and the value (true or false) indicates whether its contents should be ignored or preserved.
In other words, it seems that it's behaving as expected. If you want to not treat pre
as block text, you should omit it entirely from the object. Hope that helps!
@nonara Seems that way. Thanks!!
But it seems you must still pass an empty object as the blockTextElements, otherwise the pre elements will not be parsed. If this was intended, then it's too confusing, otherwise it's too buggy.
Hi! First, thanks for the great work on this library! This is truly the fastest and best out there.
I needed a lightning fast html-to-markdown library, and everything out there was extremely slow. With the help of
node-html-parser
, I was able to write something that blazes through it! (I'll be writing a readme soon, but it is a full release)In the process, I discovered two issues, which I'll detail below.
PRE blocks don't parse children
In pre-formatted elements, contents are always being treated as text, so child-nodes do not get parsed.
This causes an issue for multi-line code blocks. Per spec, a multi-line code block is
<code>
wrapped in<pre>
. (see: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/code#Notes)The following treats
<code ... >
as a text node, where spec behaviour is to allow HTML child nodes withinpre
You can see how we workaround this, here: https://github.com/crosstype/node-html-markdown/blob/master/src/config.ts#L79-L89
DOCTYPE is parsed as a text node
<!DOCTYPE ...>
nodes are being parsed as text nodes. I think the desired behaviour here would be to simply ignore the node.To work around this, we're temporarily replacing them ahead of time. (see: https://github.com/crosstype/node-html-markdown/blob/master/src/main.ts#L40-L42)
Side-note
This and another of my libraries broke after a recent yarn install, due to the fact that the options we passed were no longer valid. We were specifying
{ pre: true, style: false }
Not a major issue, but if you're following semver, changing public API options is generally considered a breaking change, as it could cause projects which use it as a dependency to fail to compile.
Thanks again for the great work, and I hope you have a good day!