Fixes /[\w-e]/ browser behavior

jviereck commented 4 years ago

This fixes #80.

I was not able to find a part in the spec about this. The rule I came up with is: If the from or to parts of the class range has no codePoint, then assume it's not possible to create a from and to range. In this case, emit the atoms and dash as separate elements to the parse tree output.

\cc @icefapper

nicolo-ribaudo commented 4 years ago

I can't find that part of the spec either :thinking:

When parsing [\w-e], I read the spec as follows:

Obviously, [...] are a CharacterClass
\w-e must then be parsed using the ClassRanges -> NonEmptyClassRanges production.
\w is a valid ClassAtomNoDash, because w is a valid CharacterClassEscape and thus \w can be parsed using the \ ClassEscape rule.
This means that \w-e can be parsed using the ClassAtomNoDash - ClassAtom rule of NonEmptyClassRanges.

EDIT: I found why it's disallowed outside of AnnexB: https://tc39.es/ecma262/#sec-patterns-static-semantics-early-errors

NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges
  - It is a Syntax Error if IsCharacterClass of the first ClassAtom is true or 
    IsCharacterClass of the second ClassAtom is true.
  - It is a Syntax Error if IsCharacterClass of the first ClassAtom is false and
    IsCharacterClass of the second ClassAtom is false and the CharacterValue of
    the first ClassAtom is larger than the CharacterValue of the second
    ClassAtom.

EDIT: I got it.

First of all, this only applies when /u is not set. With /u, [\w-e] is a SyntaxError.

I think that it's correct to parse \w-e as a range "going from \w to e". This is how the spec defines it, as noted above.

However, in AnnexB, when evalutaing (not when parsing) the regex there is this rule: https://tc39.es/ecma262/#sec-regular-expression-patterns-semantics

The production NonemptyClassRanges::ClassAtom-ClassAtomClassRanges evaluates as follows:
  1. Evaluate the first ClassAtom to obtain a CharSet A.
  2. Evaluate the second ClassAtom to obtain a CharSet B.
  3. Evaluate ClassRanges to obtain a CharSet C.
  4. Call CharacterRangeOrUnion(A, B) and let D be the resulting CharSet.
  5. Return the union of CharSets D and C.

This is the definition of CharacterRangeOrUnion:

CharacterRangeOrUnion ( A, B )

  1. If Unicode is false, then
      a. If A does not contain exactly one character or B does not contain
         exactly one character, then
          i. Let C be the CharSet containing the single
             character - U+002D (HYPHEN-MINUS).
         ii. Return the union of CharSets A, B and C.
  2. Return CharacterRange(A, B).

This means that, when evaluated at runtime, that range is evaluated as the union of \w, {-} and {e}.

However, this is a runtime distinction and not observable in the parser.

nicolo-ribaudo commented 4 years ago

A few tests: /[\w-e]/u is a Syntax Error because \w has multiple code points, but /[\b-e]/u isn't because \b has a single code point.

jviereck commented 4 years ago

First, thanks a lot for digging int this @nicolo-ribaudo ! The explanation makes sense. The counter example you found /[\b-e]/u is also a very good one!

It turns out /[\b]/ has a codePoint already from our parser. Therefore, I think A does not contain exactly one character corresponds to check if the atom has a codePoint - like the current version of the code does.

I've changed the implementation to take the unicode flag into account. Also, I've added a few more unit tests based on your/@nicolo-ribaudo inputs.

jviereck commented 3 years ago

Sorry for taking so long to finish this one. Merged before starting working on #105 .

jviereck / regjsparser

Fixes /[\w-e]/ browser behavior #103