taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Escape issue with querySelector [v4.1.5] #163

Closed Toilal closed 3 years ago

Toilal commented 3 years ago

I just upgraded from v4.1.4 to v4.1.5 and it seems there's a regression when using escape characters in selectors.

Consider this HTML element

<div id="equipment-basin.material"></div>

This can find the element with v4.1.4 but fails to find it with v4.1.5

document.querySelector('#equipment-basin\.material')
nonara commented 3 years ago

If you want to specify a literal backslash, you need to escape it, otherwise, it's essentially escaping the .

The following should work:

const nhp = require('node-html-parser');

const res = nhp.parse('<div id="equipment-basin.material"></div>');
console.log(res.querySelector('#equipment-basin\\.material'));
Toilal commented 3 years ago

Sadly it's not consistent with V8 dom parser, but it was on v4.1.4. A single backslash should be OK.

nonara commented 3 years ago

I am sorry to hear that this has caused issues for you. That said, I'll respond to a few points and hope it can help clear things up.

It [worked] on v4.1.4

I tested with 4.1.3, 4.1.4, 4.1.5, and 5.0.0. Unfortunately, however, each had the same result.

It's not consistent with V8 dom parser

Here is the result with v8:

html: image

select: image

Explanation

The issue is not a matter of the parser. Rather, it's how strings are handled. If you put \. in a string, you're essentially telling it:

The problem is, the dot does not need escaping in storing a string, so it effectively does not change how the string is stored

Below is an example which shows how strings are stored. It will help demonstrate what the actual parse library is receiving and why it's not working.

image

image

Hope that helps!