sublimehq / sublime_text

Issue tracker for Sublime Text
https://www.sublimetext.com
807 stars 39 forks source link

Custom regexp engine parses isolated options differently from Oniguruma #2354

Open Thom1729 opened 6 years ago

Thom1729 commented 6 years ago

Expected behavior

When Sublime's custom regexp engine handles a regexp, it should behave identically to Oniguruma.

Actual behavior

Oniguruma has a quirk when parsing isolated options (e.g. (?i)) that Sublime does not replicate. When Oniguruma encounters isolated options, the remainder of the enclosing group (or of the expression, if there is no enclosing group) is implicitly grouped. For instance, the following expressions are equivalent:

x(?i)y|z
x(?i:y|z)

The documentation is less than clear, and this behavior is unintuitive, but it is consistent. I suppose that option groups are parsed with the same precedence as the | operator.

Sublime's custom regexp engine, however, will interpret that expression differently, so that the following are equivalent:

x(?i)y|z
(?:x(?i)y)|z

As a result, the same construct may be interpreted differently depending on whether the expression triggers the Oniguruma engine or uses the native Sublime engine. This is confusing. In addition, this is an obstacle to third-party implementations and other tools.

Sample syntax

%YAML 1.2
---
name: Test Option Parsing
scope: source.test-option-parsing
contexts:
  main:
    - match: a(?i)b|c
      scope: region.redish

    - match: (?:d(?i)e)|f
      scope: region.redish

    # Force Oniguruma
    - match: u(?i)v|w(?<!0)
      scope: region.bluish

    - match: x(?i:y|z)(?<!0)
      scope: region.bluish

Sample input

ab
ac
c

de
df
f

uv
uw
w

xy
xz
z

Notes

The core HTML syntax inadvertently relies upon this bug. I will submit a PR to correct that.

A suggested best practice to avoid this issue is to avoid isolated options, except at the very beginning of an expression (and never in variables). Instead, use noncapturing groups with flags. For example, instead of a(?i)b, use a(?i:b).

FichteFoll commented 6 years ago

Is it certain that Oniguruma didn't mean x(?i)y|z to become x[Yy]|[Zz]? The wording really isn't clear on that.

Thom1729 commented 6 years ago

By observation, it's grouped like x(?i)(?:y|z) = x(?i:y|z). I've tested this in Sublime (using (?<!0) to force Oniguruma) and in the highlighter I'm working on.

FichteFoll commented 6 years ago

I rather meant it in the way whether we know it's not a bug. Because it really does seem weird to parse it like that.

Thom1729 commented 6 years ago

I've opened an issue to verify.

It would be better for Sublime to replicate the bug than to differ from Oniguruma. However, if it is a bug, and it is fixed in Oniguruma, than that might be a good reason for Sublime to update its Oniguruma version.

deathaxe commented 6 years ago

I always felt like (?i) to express some kind of globally applied flag to everything following it. This is actually what https://stackoverflow.com/questions/15145659/what-do-i-and-i-in-regex-mean#15145701 says, too.

So it is not a bug of Oniguruma.

FichteFoll commented 5 years ago

Since I just went through the referenced issue, the intended solution for Oniguruma is to interpret x(?i)y|z as x(?i)(?:y|z).

See also this test case: https://github.com/kkos/oniguruma/commit/0b7a1b9d894473b396c42c6afc99c85e280f83c9#diff-f1faa5ae6ee6c139773f8424cadf6112R398