Open jwortmann opened 5 years ago
Maybe this could be applied to parse RegExp strings and allow consistent syntax highlighting of Regular Expressions in more languages.
A very good point. I also wonder why an dedicated regexp syntax exists while different syntaxes use their own implementation. I can imagine two possible reasons:
string.regexp.tcl
limitations the parsing time of some official TCL library sources slowed down by 20 to 30%.That said, I agree with regexp syntaxes to be a bit inconsistent in manner of scope naming. I'd guess the scopes were applied based on existing color schemes rather then by logical structure. One reason might be - there is no clear set of rules how to name different parts of a regexp?
I'd never call ?<=
a constant
for instance. As the definition of a lookbehind it would need to be scoped as keyword.operator
or punctuation.definition.lookbehind
. Same with all the parentheses. Thy are no operators but punctuations, ... .
\d
and \w
and friends are constant.character.escape
.
I definitely think number 2 is the biggest contributor, at least that is why I haven't switched any syntax definitions I have worked on to use the "generic" one (where number 1 applies). (it's not generic, it's designed with ST's Find functionality in mind - for example whether \<
is an unnecessarily escaped char or a meta character depends on the engine used)
I hadn't compared performance but it doesn't surprise me as the embedded regex definitions are generally much simpler and less accurate than the main standalone one (not referring to it as "generic" any more ;))
that said, clearly there is room for improvement/unification of scopes. Maybe the embedded ones could include contexts from the standalone one if we design it in such a way that those contexts are generic enough to apply to multiple regex parser/engine implementations, so that we don't duplicate work/scopes etc
Now that we have syntax inheritance, it makes more sense than ever IMO to have a "base/common" regex syntax, and inherit from that for "extra" features not generally supported. So we might have separate syntaxes like (maybe we already do, didn't really check):
and so on. Then, most scopes would be set by Regex Common and it could solve any scope mismatches. It does mean we'd have more syntax definitions as they could no longer be embedded in the "owning language"'s syntax definition, but as they'd be hidden I think it would have little real-world impact.
Would be a great improvement (and "some" work to do).
Python or PHP already maintain a dedicated own syntax definition file. Perl and some other syntaxes use Regular Expressions syntax. So number of syntaxes might not increase too much.
We might probably need some hidden intermediate syntaxes to properly support embedding/interpolation anyway in the future. Just see: #2654, #2789 or #2797.
Is this basically solved now?
Just JavaScript left I believe - it still has a Regex syntax which needs to be refactored to extend our base regex syntax
PHP as well, IIRC?
Only JavaScript now?
Haven't touched PHP's regexp so far, with regards to reusing RegExp package.
Ruby also uses regex stuff from its own syntax file, which is pretty minimal: no |
and \
-anything is a constant.character.escape
.
Ruby does have some heuristic to make sure that /=
is divide-and-assign, rather than opening a new regex.
Various languages have inbuilt support for Regular Expressions and some default syntax packages provide rules to apply suitable scopes within RegExp strings. However, these scope names seem to differ in various places, for example the standalone RegExp syntax and Clojure use
keyword.operator.alternation.regexp
for the|
symbol, while JavaScript, Python and PHP usekeyword.operator.or.regexp
. Another example are character classes such as\d
or\w
, which get the scopekeyword.control.character-class.regexp
in the standalone RegExp syntax,constant.other.character-class.escape.backslash.regexp
in JavaScript andconstant.character.character-class.regexp
in Python and PHP. Other languages such as Tcl and Ruby recognize RegExp strings, but do not apply specific scopes other thanstring.regexp
, which prevents syntax highlighting of Regular Expressions in these languages.I want to refine my color scheme for consistent RegExp highlighting, but the currently used scope names make it difficult to find common highlighting rules for all languages. My knowledge of syntax definitions is somewhat limited, but as far as I know there is the possibility to embed a syntax within another language syntax (e.g. CSS in HTML). Maybe this could be applied to parse RegExp strings and allow consistent syntax highlighting of Regular Expressions in more languages.
Regular Expression syntax:
JavaScript syntax:
Python syntax:
Ruby syntax:
Clojure syntax:
Progress