mathiasbynens / he

A robust HTML entity encoder/decoder written in JavaScript.
https://mths.be/he
MIT License
3.45k stars 254 forks source link

`he.decode` is decoding HTML sequences without accounting for the presence or absence of a semicolon #93

Closed hansede closed 2 months ago

hansede commented 2 months ago

he version: 1.2.0.

According to the official HTML spec, HTML escape sequences must end with a semicolon (;) character:

Character references must start with a U+0026 AMPERSAND character (&). Following this, there are three possible kinds of character references:

Named character references The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).

Decimal numeric character reference The ampersand must be followed by a U+0023 NUMBER SIGN character (#), followed by one or more ASCII digits, representing a base-ten integer that corresponds to a code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

Hexadecimal numeric character reference The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more ASCII hex digits, representing a hexadecimal integer that corresponds to a code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

https://html.spec.whatwg.org/multipage/syntax.html#character-references

However, he.decode doesn't account for the presence of a semicolon, leading to the following behavior:

> const he = require('he');
undefined
> he.decode('&amp');
'&'
> he.decode('&');
'&'

Instead, I would expect the following:

> const he = require('he');
undefined
> he.decode('&amp');
'&amp'
> he.decode('&');
'&'

This led to a production issue in my application where Google Maps URLs, which include an &center query param, were being decoded with ¢er as a query param instead, breaking maps on my pages.

mathiasbynens commented 2 months ago

According to the official HTML spec, HTML escape sequences must end with a semicolon (;) character:

This is the case in order to be "valid HTML". However, character references without semicolons do work (as in, they have specified parsing behavior), just with an error: https://html.spec.whatwg.org/multipage/parsing.html#parse-error-missing-semicolon-after-character-reference

It sounds like you want the strict option: https://github.com/mathiasbynens/he#strict-1

hansede commented 2 months ago

Oof, it's frustrating that the section you linked about the semicolon parsing error invalidates the rather concrete statement about semicolons in the "character references" section, and without being referenced in the "character references" section. You are correct though, and I'll use the strict option instead.