Closed nonara closed 3 years ago
I have the same issue, it just trims leading and trailing spaces. Seems like the code which causes the issue is in html.d.ts
:
return blocks.map((block) => {
// Normalize each line's whitespace
return block.join('').trim().replace(/\s{2,}/g, ' ');
})
.join('\n').replace(/\s+$/, ''); // trimRight;
@serjant Good find. Thanks!
Any maintainers around to weigh in on this? I'd propose a PR which removes the trim, effectively fixing the issue of removing legitimate leading/trailing whitespace. I'm happy to do this.
If there are any potential issues or side effects with this method, please let me know.
@nonara Likewise the trim() method is dangerous in that piece of code, because the trim() method removes whitespace from both ends of a string. Whitespace in this context is all the whitespace characters (space, tab, no-break space, etc.) and all the line terminator characters (LF, CR, etc.). See the specification
@taoqf Just checking up on this. It's been close to a month without a reply, so I need to make a determination on what to do.
At this point there are two proposals -
hasLeadingWhiteSpace
and hasTrailingWhiteapace
properties to TextNodeI would suggest the former, as it complies with spec and makes the most sense.
I'm ready to do a PR for either with all necessary tests.
I don't mean to rush, but as this is a dependency, I will either need to write a new parser, fork this one, or (ideally) get a PR through. Whatever you decide is fine, but I would greatly appreciate a response to advise on which route I should take to get this fixed in my library.
Thanks!
Sorry, I thought you would do a pr, now, I'll remove trim now.
No worries. It's working well! Thanks!
Looks like I was wrong here. I didn't notice that I had been using HTMLElement#removeWhitespace
, whose primary function is to remove empty whitespace nodes.
Unfortunately, it also calls trim()
on the text for each TextNode.
Details and fix are in #123 - I took an approach which should also help restore some trim functionality that was removed in v3.2.2
Issue
Example:
This, unfortunately produces a TextNode whose text is
Hello
as opposed toHello<space>
Examining the whole text of the parentp
isHelloworld!
Unfortunately this is irreparably breaking
node-html-markdown
, as I have no way to determine whether there is a leading or trailing space.see: https://github.com/crosstype/node-html-markdown/issues/9
Proposal
In effort to be consistent with HTML spec, a single leading or trailing space should be captured, when present, in the text for any TextNode.
I believe this proposal would be consistent with the behaviour of other parsers. However, if you decide against that route, an alternate proposal would be to add properties to the node
hasLeadingSpace
/hasTrailingSpace
.Would be interested to hear your thoughts on these. Thanks!