taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

fix!: TextNode text property is not decoded (fixes #134) #135

Closed nonara closed 3 years ago

nonara commented 3 years ago

Addressed

  1. TextNode#Text now is html-decoded
  2. HtmlNode#structuredText now uses html-decoded text
  3. Added trimmed text cache invalidation on rawText update (ensures up to date after HtmlNode#removeWhitespace is called)

Notes

Associated Issue

https://github.com/crosstype/node-html-markdown/issues/14

nonara commented 3 years ago

All set for review @taoqf

taoqf commented 3 years ago

Thanks for your work. based on your pr, I digged it up. please check the code https://github.com/taoqf/node-html-parser/commit/840ffdada195b15b3902135c3635d6f6daf90361 if you have time. Let me know if I missed anything.

nonara commented 3 years ago

Hi. I appreciate your taking the time! I did find a few issues.

Principally, according to spec TextNode#textContent should be the raw content, which may cause breaking issue for people (more detail below)

An appeal

As for this proposal, please consider the following:

Because of this, I would classify this as an error, as it's going against established convention setup for nodes as well as its documentation. It also effectively eliminates the purpose for having both properties.

Put more simply, why should text and rawText be different for HtmlNode and the same for TextNode? This bypasses the purpose.

So, although it's technically a breaking change, it is one which makes the code do what the documentation and convention says.

Issue with textContent

The issue with changing textContent is that this property is actually present in the HTML DOM spec, which means that this may actually cause breaking changes for people who are using it. Because it's a standardized property, this may actually affect even more users.

If you absolutely do not want to change the text field, my recommendation is to simply add a new property or to close the issue without fixing it, in which case, I'll workaround on my end, but I hope I can persuade otherwise.

Next steps

I would advise removing your recent commit, for now, and I've given you access to this PR in case you need to make changes.

I've made some corrections to my PR based on your commit. I've also added notes to make it easier to understand what I did. I should have done that before. I apologize.

Please have a look over the comments in the files changed tab, and let me know.

nonara commented 3 years ago

As a secondary suggestion, I recall your saying you don't have much time for this repo anymore. If you'd like help maintaining it, I'd be happy to come on board.

I currently maintain several large, critical libraries in the parser/compiler space. Because we're looking at a major version bump, if it made you feel better about it, I could also go through and fix the active issues that are bugs, as well as clean up the tests and update them to use jest. That way we can test against uncompiled source. I also can be sure to maintain (and likely improve) performance time, as I do/did with node-html-markdown, obtaining 3x faster than the leading node compiler.

I could also add github actions to automate testing in pull requests, pushes, etc. That can make reviewing PRs a lot easier.

Let me know if any if that sounds good, otherwise, just getting this wrapped is fine.

taoqf commented 3 years ago

Thank you for all your work.

taoqf commented 3 years ago

Should I increase the major version? I have not publish new version yet.

nonara commented 3 years ago

@taoqf You're very welcome! Thanks for addressing it so quickly! Yes, I would definitely publish as a new major version.

I'd release as v4.0.0 and add a release with the following notes:

Breaking Changes

taoqf commented 3 years ago

https://github.com/taoqf/node-html-parser/releases/tag/v4.0.0