hoaproject / Regex

The Hoa\Regex library.
https://hoa-project.net/
310 stars 17 forks source link

Incomplete support for internal option setting #29

Open ju1ius opened 6 years ago

ju1ius commented 6 years ago

Hi !

What works

All the above work only for the i, m, s and x options.

What doesn't work:

  1. Setting / unsetting the U, X, and J options
  2. Setting several options: a(?im)b
  3. Unsetting several options: a(?-i-m)b
  4. Mixing the above two: a(?i-m)b
  5. Setting options for a non-capturing group: a(?i:b)c
  6. The grammar allows the (?+i) syntax, but according to the documentation and the PHP implementation this is invalid.

All the above fail with: Unexpected token "?" (zero_or_one) at line 1 and column 3

Possible fixes

Changing the grammar to:

// Internal options.
%token internal_option \(\?(-?[imsxJUX])+\)

solves n° 1, 2, 3, 4 & 6. n° 5 is a bit more complex... :wink:

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/53865307-incomplete-support-for-internal-option-setting?utm_campaign=plugin&utm_content=tracker%2F6167031&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F6167031&utm_medium=issues&utm_source=github).
ju1ius commented 6 years ago

Changing the grammar to:

// Tokens
%token internal_option_             \(\?(?=-?[imsxJUX])         -> opt
%token opt:internal_option          -?[imsxJUX]                 -> opt
%token opt:semicolon                :                           -> default
%token opt:_internal_option         \)                          -> default

// Rules
internal_options:
    ::internal_option_:: options() ::_internal_option::
    | ::internal_option_:: options() ::semicolon:: alternation() ::_capturing:: #noncapturing

options:
    <internal_option>+ #internal_options

yields the following parse trees:

Pattern: a(?i)b
>  #expression
>  >  #concatenation
>  >  >  token(literal, a)
>  >  >  #internal_options
>  >  >  >  token(opt:internal_option, i)
>  >  >  token(literal, b)
Pattern: a(?i:b)c
>  #expression
>  >  #concatenation
>  >  >  token(literal, a)
>  >  >  #noncapturing
>  >  >  >  #internal_options
>  >  >  >  >  token(opt:internal_option, i)
>  >  >  >  token(literal, b)
>  >  >  token(literal, c)

Which seem syntactically correct since, to me at least, (?i:b) means «a non-capturing-group for which the i option is set».

What do you think ?

Hywan commented 6 years ago

Thanks for the report!

Let's consider the following diff:

- %token  internal_option          \(\?[\-+]?[imsx]\)
+ %token  internal_option          \(\?(-?[imnsx]+)*\)

It should solve problems 1, 2, 3, 4, 6, 7, and 8.

Problem 5 is more tricky, and it's not related to “internal option” directly. Can you open another issue to address it please?