Closed jviereck closed 3 years ago
I can't find that part of the spec either :thinking:
When parsing [\w-e]
, I read the spec as follows:
[
...]
are a CharacterClass
\w-e
must then be parsed using the ClassRanges
-> NonEmptyClassRanges
production.\w
is a valid ClassAtomNoDash
, because w
is a valid CharacterClassEscape
and thus \w
can be parsed using the \ ClassEscape
rule.\w-e
can be parsed using the ClassAtomNoDash - ClassAtom
rule of NonEmptyClassRanges
.EDIT: I found why it's disallowed outside of AnnexB: https://tc39.es/ecma262/#sec-patterns-static-semantics-early-errors
NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges
- It is a Syntax Error if IsCharacterClass of the first ClassAtom is true or
IsCharacterClass of the second ClassAtom is true.
- It is a Syntax Error if IsCharacterClass of the first ClassAtom is false and
IsCharacterClass of the second ClassAtom is false and the CharacterValue of
the first ClassAtom is larger than the CharacterValue of the second
ClassAtom.
EDIT: I got it.
First of all, this only applies when /u
is not set. With /u
, [\w-e]
is a SyntaxError.
I think that it's correct to parse \w-e
as a range "going from \w
to e
". This is how the spec defines it, as noted above.
However, in AnnexB, when evalutaing (not when parsing) the regex there is this rule: https://tc39.es/ecma262/#sec-regular-expression-patterns-semantics
The production NonemptyClassRanges::ClassAtom-ClassAtomClassRanges evaluates as follows:
1. Evaluate the first ClassAtom to obtain a CharSet A.
2. Evaluate the second ClassAtom to obtain a CharSet B.
3. Evaluate ClassRanges to obtain a CharSet C.
4. Call CharacterRangeOrUnion(A, B) and let D be the resulting CharSet.
5. Return the union of CharSets D and C.
This is the definition of CharacterRangeOrUnion
:
CharacterRangeOrUnion ( A, B )
1. If Unicode is false, then
a. If A does not contain exactly one character or B does not contain
exactly one character, then
i. Let C be the CharSet containing the single
character - U+002D (HYPHEN-MINUS).
ii. Return the union of CharSets A, B and C.
2. Return CharacterRange(A, B).
This means that, when evaluated at runtime, that range is evaluated as the union of \w
, {-
} and {e
}.
However, this is a runtime distinction and not observable in the parser.
A few tests: /[\w-e]/u
is a Syntax Error because \w
has multiple code points, but /[\b-e]/u
isn't because \b
has a single code point.
First, thanks a lot for digging int this @nicolo-ribaudo ! The explanation makes sense. The counter example you found /[\b-e]/u
is also a very good one!
It turns out /[\b]/
has a codePoint
already from our parser. Therefore, I think A does not contain exactly one character
corresponds to check if the atom has a codePoint - like the current version of the code does.
I've changed the implementation to take the unicode flag into account. Also, I've added a few more unit tests based on your/@nicolo-ribaudo inputs.
Sorry for taking so long to finish this one. Merged before starting working on #105 .
This fixes #80.
I was not able to find a part in the spec about this. The rule I came up with is: If the from or to parts of the class range has no
codePoint
, then assume it's not possible to create afrom
andto
range. In this case, emit the atoms and dash as separate elements to the parse tree output.\cc @icefapper