taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

fix issue #115 #123

Closed nonara closed 3 years ago

nonara commented 3 years ago

Abstract

The following PR takes a holistic approach to mitigating issue #115 by implementing HTML-specific trim functionality, where necessary. Its aim is to provide a more accurate correction for the trim() issue by replicating trim, while explicitly preserving only single leading or trailing non-breaking whitespace, where present.

Background

In addition to the already corrected area (structuredText) a second affected area was found.

The purpose of HTMLElement#removeWhitespace is to remove unnecessary whitespace nodes and otherwise cleanup the text. To do this, it calls trim() on TextNode rawText. Unfortuntately, this is too greedy.

Trim rightly removes repeating spaces, tabs, and line-breaks, which are often simply junk from multi-line HTML formatting. Unfortunately, it also removes legitimate, single, non-breaking space at the beginning or end of the text in a TextNode.

Put more simply, it's overcorrecting in some scenarios. Here are some examples:

<p>Hello <em>world!</em></p>` <!-- Desired: 'Hello world!' Actual: 'Helloworld!` -->
<p>Hello\n\r\r<em>world </em>!</p>` <!-- Desired: 'Helloworld !' Actual: 'Helloworld!` -->
<p>Hello\n\r\t<em>world   </em>!</p>` <!-- Desired: 'Hello world !' Actual: 'Helloworld!` -->

Solution

I've replaced the two calls to trim() with a cached accessor on each TextNode. The accessor performs the same action as trim, with the exception that it will allow a single non-breaking space before and after the text, if one is present.

Associated Issues

taoqf/node-html-parser#115 crosstype/node-html-markdown#9

nonara commented 3 years ago

@taoqf Ready to go. Let me know if you have any questions.

taoqf commented 3 years ago

So many thanks, I don't have much time on this rep recently, so I am going to merge your pr and publish another version now.

nonara commented 3 years ago

No problem! I know how it is. Thanks for the quick merge!