taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Quotes in HTML attributes escaped which breaks HTML #62

Closed optimalisatie closed 4 years ago

optimalisatie commented 4 years ago

Hi!

I wanted to report an issue:

JSON values of HTML attributes are rewritten to an escaped value which breaks the HTML:

<div data-json='{
    "json": "value"
}'></div>

Result of .toString():

<div data-json="{\"json\":\"value\"}"></div>

Edit

Since the goal of the HTML parser is speed, it may be best to replace JSON.stringify for HTML attributes with a simple string based value verification and leave the original value, even if it would be a mere space or empty string, intact. It could save 50,000+ JSON.stringify calls for some HTML documents.

For some attributes or Javascript functionality it does matter if the attribute contains ="". Stripping it would cost parsing resources while it seems to provide no other advantage than HTML compression, which does not seem to be a goal of the HTML parser.

The following example may provide a hint for a solution:

// Update rawString
const quoteRegex = /"/g; // re-use

this.rawAttrs = Object.keys(attrs).map(function(name) {
    var val = attrs[name];
    if (val === undefined) { // not a string
        return name;
    } else {
        return name + '="' + val.replace(quoteRegex, '&#34;') + '"';
    }
}).join(' ');
taoqf commented 4 years ago

Sorry I am afraid this would lead to other errors. you can fork this lib and run npm test if you would see the errors. and, pr is welcomed. Anyway, thank you for your support.

lamplightdev commented 3 years ago

I've come across this issue too, and it can be solved by replacing double quotes (") with &quot; in attribute values. I've submitted a PR that implements this and adds new tests.

taoqf commented 3 years ago

@lamplightdev Thank you.