[Regular Expression] Unified scope names for (embedded) Regular Expressions

jwortmann commented 5 years ago

Various languages have inbuilt support for Regular Expressions and some default syntax packages provide rules to apply suitable scopes within RegExp strings. However, these scope names seem to differ in various places, for example the standalone RegExp syntax and Clojure use keyword.operator.alternation.regexp for the | symbol, while JavaScript, Python and PHP use keyword.operator.or.regexp. Another example are character classes such as \d or \w, which get the scope keyword.control.character-class.regexp in the standalone RegExp syntax, constant.other.character-class.escape.backslash.regexp in JavaScript and constant.character.character-class.regexp in Python and PHP. Other languages such as Tcl and Ruby recognize RegExp strings, but do not apply specific scopes other than string.regexp, which prevents syntax highlighting of Regular Expressions in these languages.

I want to refine my color scheme for consistent RegExp highlighting, but the currently used scope names make it difficult to find common highlighting rules for all languages. My knowledge of syntax definitions is somewhat limited, but as far as I know there is the possibility to embed a syntax within another language syntax (e.g. CSS in HTML). Maybe this could be applied to parse RegExp strings and allow consistent syntax highlighting of Regular Expressions in more languages.

Regular Expression syntax:

    (?<=(T|t)he\s)(cat)$
(?#  ^^^ constant.other.assertion )
(?#       ^ keyword.operator.alternation )
(?#            ^^ keyword.control.character-class )
(?#                    ^ keyword.control.anchors )

JavaScript syntax:

var regex = /(?<=(T|t)he\s)(cat)$/;
//            ^^^ punctuation.definition.group.assertion
//                 ^ keyword.operator.or
//                      ^^ constant.other.character-class.escape.backslash
//                              ^ keyword.control.anchor

Python syntax:

regex = r'(?<=(T|t)he\s)(cat)$'
#          ^^^ constant.other.assertion
#               ^ keyword.operator.or
#                    ^^ constant.character.character-class
#                            ^ keyword.control.anchor

Ruby syntax:

regex = /(?<=(T|t)he\s)(cat)$/
#                    ^^ constant.character

Clojure syntax:

#"(?<=(T|t)he\s)(cat)$"
;  ^^^ constant.other.assertion
;       ^ keyword.operator.alternation
;            ^^ keyword.control.character-class
;                    ^ keyword.control.anchors

Progress

[x] Regular Expression
[ ] JavaScript
[x] Python
[x] Clojure
[ ] PHP
[ ] Ruby

deathaxe commented 5 years ago

Maybe this could be applied to parse RegExp strings and allow consistent syntax highlighting of Regular Expressions in more languages.

A very good point. I also wonder why an dedicated regexp syntax exists while different syntaxes use their own implementation. I can imagine two possible reasons:

a historical thing of developement
different feature levels and implementations of the underlying regexp engines of several languages, which make merging everything together impossible without causing things being highlighted in the wrong way for single syntaxes.
the dedicated regexp syntax seems quite heavy compared to some others and causes significant slowdowns in parsing, when embedded to other languages. After embedding that syntax into a new TCL implementation to overcome the string.regexp.tcl limitations the parsing time of some official TCL library sources slowed down by 20 to 30%.

That said, I agree with regexp syntaxes to be a bit inconsistent in manner of scope naming. I'd guess the scopes were applied based on existing color schemes rather then by logical structure. One reason might be - there is no clear set of rules how to name different parts of a regexp?

I'd never call ?<= a constant for instance. As the definition of a lookbehind it would need to be scoped as keyword.operator or punctuation.definition.lookbehind. Same with all the parentheses. Thy are no operators but punctuations, ... .

\d and \w and friends are constant.character.escape.

keith-hall commented 5 years ago

I definitely think number 2 is the biggest contributor, at least that is why I haven't switched any syntax definitions I have worked on to use the "generic" one (where number 1 applies). (it's not generic, it's designed with ST's Find functionality in mind - for example whether \< is an unnecessarily escaped char or a meta character depends on the engine used) I hadn't compared performance but it doesn't surprise me as the embedded regex definitions are generally much simpler and less accurate than the main standalone one (not referring to it as "generic" any more ;)) that said, clearly there is room for improvement/unification of scopes. Maybe the embedded ones could include contexts from the standalone one if we design it in such a way that those contexts are generic enough to apply to multiple regex parser/engine implementations, so that we don't duplicate work/scopes etc

keith-hall commented 3 years ago

Now that we have syntax inheritance, it makes more sense than ever IMO to have a "base/common" regex syntax, and inherit from that for "extra" features not generally supported. So we might have separate syntaxes like (maybe we already do, didn't really check):

Regex Common
Regular Expressions (<- the one used in the Find panel)
PHP Regular Expressions
Python Regular Expressions

and so on. Then, most scopes would be set by Regex Common and it could solve any scope mismatches. It does mean we'd have more syntax definitions as they could no longer be embedded in the "owning language"'s syntax definition, but as they'd be hidden I think it would have little real-world impact.

deathaxe commented 3 years ago

Would be a great improvement (and "some" work to do).

Python or PHP already maintain a dedicated own syntax definition file. Perl and some other syntaxes use Regular Expressions syntax. So number of syntaxes might not increase too much.

We might probably need some hidden intermediate syntaxes to properly support embedding/interpolation anyway in the future. Just see: #2654, #2789 or #2797.

michaelblyons commented 2 years ago

Is this basically solved now?

keith-hall commented 2 years ago

Just JavaScript left I believe - it still has a Regex syntax which needs to be refactored to extend our base regex syntax

deathaxe commented 2 years ago

PHP as well, IIRC?

michaelblyons commented 2 years ago

~~Only JavaScript now?~~

deathaxe commented 2 years ago

Haven't touched PHP's regexp so far, with regards to reusing RegExp package.

michaelblyons commented 1 year ago

Ruby also uses regex stuff from its own syntax file, which is pretty minimal: no | and \-anything is a constant.character.escape.

Ruby does have some heuristic to make sure that /= is divide-and-assign, rather than opening a new regex.

sublimehq / Packages

[Regular Expression] Unified scope names for (embedded) Regular Expressions #1942

Progress