taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

Leading/trailing single spaces not preserved in node text #115

Closed nonara closed 3 years ago

nonara commented 3 years ago

Issue

Example:

<p>Hello <strong>world</strong>!</p>

This, unfortunately produces a TextNode whose text is Hello as opposed to Hello<space> Examining the whole text of the parent p is Helloworld!

Unfortunately this is irreparably breaking node-html-markdown, as I have no way to determine whether there is a leading or trailing space.

see: https://github.com/crosstype/node-html-markdown/issues/9

Proposal

In effort to be consistent with HTML spec, a single leading or trailing space should be captured, when present, in the text for any TextNode.

I believe this proposal would be consistent with the behaviour of other parsers. However, if you decide against that route, an alternate proposal would be to add properties to the node hasLeadingSpace / hasTrailingSpace.

Would be interested to hear your thoughts on these. Thanks!

serjant commented 3 years ago

I have the same issue, it just trims leading and trailing spaces. Seems like the code which causes the issue is in html.d.ts:

return blocks.map((block) => {
            // Normalize each line's whitespace
            return block.join('').trim().replace(/\s{2,}/g, ' ');
        })
            .join('\n').replace(/\s+$/, '');    // trimRight;
nonara commented 3 years ago

@serjant Good find. Thanks!

Any maintainers around to weigh in on this? I'd propose a PR which removes the trim, effectively fixing the issue of removing legitimate leading/trailing whitespace. I'm happy to do this.

If there are any potential issues or side effects with this method, please let me know.

serjant commented 3 years ago

@nonara Likewise the trim() method is dangerous in that piece of code, because the trim() method removes whitespace from both ends of a string. Whitespace in this context is all the whitespace characters (space, tab, no-break space, etc.) and all the line terminator characters (LF, CR, etc.). See the specification

nonara commented 3 years ago

@taoqf Just checking up on this. It's been close to a month without a reply, so I need to make a determination on what to do.

At this point there are two proposals -

  1. Remove trim() to correct the issue
  2. Add hasLeadingWhiteSpace and hasTrailingWhiteapace properties to TextNode

I would suggest the former, as it complies with spec and makes the most sense.

I'm ready to do a PR for either with all necessary tests.

I don't mean to rush, but as this is a dependency, I will either need to write a new parser, fork this one, or (ideally) get a PR through. Whatever you decide is fine, but I would greatly appreciate a response to advise on which route I should take to get this fixed in my library.

Thanks!

taoqf commented 3 years ago

Sorry, I thought you would do a pr, now, I'll remove trim now.

taoqf commented 3 years ago

https://github.com/taoqf/node-html-parser/releases/tag/v3.3.2

nonara commented 3 years ago

No worries. It's working well! Thanks!

nonara commented 3 years ago

Looks like I was wrong here. I didn't notice that I had been using HTMLElement#removeWhitespace, whose primary function is to remove empty whitespace nodes.

Unfortunately, it also calls trim() on the text for each TextNode.

Details and fix are in #123 - I took an approach which should also help restore some trim functionality that was removed in v3.2.2

taoqf commented 3 years ago

https://github.com/taoqf/node-html-parser/releases/tag/v3.3.4