taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Cannot determine if whitespace exists between nodes #137

Closed nonara closed 3 years ago

nonara commented 3 years ago

Issue

Assuming:

<span>test1</span> <span>test2</span>
<span>test3</span>
<span>test4</span>

In browsers, this is rendered as:


test1 test2 test3 test4


However, it is rendered by the parser as test1test2test3test4

The deeper issue is that whitespace between nodes is not being recorded or indicated in any way.

Solutions

Playing around with this on astexplorer.net shows that most parsers (ie. htmlparser2, parse5, etc) create a TextNode for the whitespace.

What's interesting, however, is that Angular takes a more intelligent route, which is likely faster. Like node-html-parser, it does not create a TextNode for these. Instead, it allows users to determine for themselves via the range information attached to each node.

The range information, offered by most parsers, is simply the specific index for where a node begins and ends. Specifically, these positions are for the first char of the opening tag and the last of the closing tag, respectively.

Proposed solution

I propose simply adding a range array to each node, per convention. In so doing, we are able to determine whether a node has trailing whitespace.

For example:

<!-- The following nodes have contiguous ranges. The ranges are [ 0, 17 ] and [ 17, 35 ], respectively. -->
<!-- When we compare the end of the first node (17) with the start of the next (17), we can see there is no space -->
<span>text1</span><span>text2</span>

<!-- These nodes, however are non-contiguous. The ranges are [ 0, 17 ] and [ 18, 37 ], respectively. -->
<!-- By comparing the end and start locations, we know that there is at least one whitespace char between them -->
<span>text1</span>  <span>text2</span>

I am submitting a PR shortly.

Related Issue

https://github.com/crosstype/node-html-markdown/issues/16