revisit the syntax regexes with scala sublime users

fommil commented 7 years ago

@dickwall tells me that @djspiewak has been rewriting the sublime syntax regexes to be super efficient and false negative less on idiomatic code. We might want to sync up at some point so that we get all the benefits. Daniel, what license are you using and where can we find your regexes?

fommil commented 7 years ago

there are also a few places were the regex based matchers in emacs definitely consistently fail, e.g. marking type parameters as constants instead of types. ENSIME overrides that but it feels a bit heavyweight.

djspiewak commented 7 years ago

@fommil You can find them in the sublimehq/Packages. The license appears to be here, and it looks like a modified BSD. Where by "modified" I mean "most of the clauses deleted". So it's basically "public domain".

The changes are all in the Scala/Scala.sublime-syntax file. Note that I have several outstanding PRs which continue to improve on the situation.

A lot of the benefits of my changes come down to the use (and abuse) of a few discrete features:

Stateful context shifting. Sublime modes have a notion of "context", and which context you're in is determined by what is on the top of the stack. This is obviously not at all unusual. What is slightly more unusual is it has a construct for popping and pushing in the same operation, meaning that you have the ability to encode directed state transitions in a syntax mode. I use this in several places, but most notably:
- Class/method definitions
- Contextual non-backtracking "lookbehind" for highlighting the unit constant (and not empty method parameters)
- Spec-compliant XML parsing
- Auto-indentation guiding meta scopes (sublime has a free-form notion of scoping, which allows the application of overlapping scopes which don't affect highlighting, but can affect other things like indentation) for case expressions. This is required because the indentation rules for the first case in a block are different than for all subsequent cases, and the indentation of the closing curly brace is dependent on the exact form of the immediately-preceding case
- Highlighting of types in delimited and undelimited position
- more stuff that I've probably forgotten
Extremely fast lookahead. Sublime's regular expression engine operates on a line-by-line basis, which is somewhat restrictive, but it does allow the engine to deterministically compile ambiguous lookahead into a non-backtracking PDA, which is why the highlighting (and symbol indexing) is as fast as it is. So a lot of things in the revised Scala mode are done via positive lookahead. For example, for every expression position, we aggressively lookahead to see if we can find a => token (or the equivalent unicode) while passing over only tokens which can be involved in a lambda parameter block (e.g. parentheticals, ascription, variables, etc). If that (quite complex!) lookahead is matched, we don't consume and instead push a new context which parses those tokens as a lambda declaration. This would be hideously inefficient in most regular expression engines.
Free-form scoping. As I mentioned, Sublime has a notion of entirely free-form scoping. There are guidelines for producing scopes which are likely to be respected by most color schemes, but scopes are hierarchical and it's always possible to produce ever-more-specific scoping underneath broad scopes which are highlighted by most schemes. For example: entity.name.class vs entity.name.trait and so on. Most schemes highlight just entity.name or even just entity, but the specific scoping gives a lot of power. Sublime scopes are fairly close to CSS classes in spirit (they can overlap, refine each other, etc), which provides a great deal of power even above and beyond what color schemes can respect. Meta-scoping affecting auto-indentation is one example of this, but there are other examples.

So I don't know how much of this is applicable to emacs or even ensime, but it's a thing. :-)

fommil commented 7 years ago

@djspiewak cool, thanks! In raw emacs mode there is definitely a concept of local context matching. As for ensime, it gets the semantic information from the AST so no regexes needed. When ensime is enabled it overrides the pattern matching from scala-mode but you're right that sometimes it's useful to have fast regex matchers find the right semantics first before waiting (and effectively have ensime confirm the colouring) and of course there are people who use emacs without ensime so it's good for it all to match up as close as possible. We're far from that goal right now.

djspiewak commented 7 years ago

@fommil Yeah, I've given some thought to that from a Sublime standpoint (specifically, what would be the optimal way for ENSIME to interact with the mode). I think that what should happen, ideally, is the semantic highlighting would refine the scopes. The most notable place where this would happen is applying the variable.function.scala scope to any tokens which correspond to function invocations, and maybe a meta.coercion.scala scope (or something more imaginative) to expressions which are implicitly converted. That sort of thing. Basically all of the syntactic things which are highlighted by Sublime (at this point) are accurate, though there are a couple places (e.g. where a lambda declaration is broken by a newline) where we underapproximate in unidiomatic usage.

But broadly speaking, the fact that we can just toss more scopes on what is already there (which color schemes can choose to highlight or just ignore) is very powerful, and it will eventually allow the semantic highlighting in Sublime ENSIME to be quite advanced and also gracefully and performantly fall back on the (now quite accurate) core mode.

hvesalai commented 7 years ago

Are the regexes at all comparable at the moment? I.e. is this something that can actually be accomplished in finite time?

djspiewak commented 7 years ago

@hvesalai They're probably not directly comparable, but most of the basic stuff should be easily converted. More complex stuff like the type environment, lambda lookahead, etc might not be easily achieved. Most of the basic stuff in the new sublime mode is actually taken from the Scala Specification (e.g. definition of numeric literals, variables, etc).

hvesalai / emacs-scala-mode

revisit the syntax regexes with scala sublime users #121