Closed user-982634 closed 2 years ago
Good finds.
The second is definitely a bug. The syntax's implementation of the automatic semicolon insertion algorithm wasn't looking for the instanceof
operator. I've added that, and the in
operator as well.
The first is a bit trickier. In most places, await
is a valid identifier, but in async functions, it's a keyword. Making this distinction in the syntax definition would essentially require another copy of most of the definition — plus a third copy for yield
in generator functions, and possibly a fourth for the async generator proposal. On the other hand, this bug only occurs if a developer inadvisably names a variable await
in a context where that is not outright forbidden, and the impact should be minor outside contrived cases. Therefor, I suspect that the cure (probably doubling the size of the syntax) would be worse than the disease.
I've added that, and the in operator as well.
Thank you for the fast fix!
I suspect that the cure (probably doubling the size of the syntax) would be worse than the disease.
I am pretty sure there must be an easier solution than doubling the syntax. Ok, I agree that the first test case may be a low priority bug, but it is still a bug and it breaks highlighting to the end of the file.
But can you please also look at other examples I posted,I mean for example of
operator or asynchronous for loop, or a class extending a regex? I am sure you can fix these easily like you fixed the second test case
I am pretty sure there must be an easier solution than doubling the syntax.
Take a look at how the ECMAScript grammar is formally specified; its productions (corresponding loosely to sublime-syntax contexts) are parameterized by yield?
and await?
parameters. Sublime has no analogue to this. The only real solution would be to have a separate copy of each context for every combination of parameters for the associated productions. If there is an alternative solution, I would love to see it!
it breaks highlighting to the end of the file.
Ordinarily, there ought to be reasonable fallback behavior in such cases, but here I don't see how that could work. Open to suggestions.
(a
)=>a
(
a)=>a
These are unfixable. JavaScript's syntax in these places is not deterministic context-free. All that Sublime can do is make a reasonable guess and try to recover gracefully from a wrong guess. If the highlighting fails ungracefully, that can and should be fixed.
I have proposed a syntax extension that would allow Sublime to parse arbitrary (not-necessarily-deterministic) context-free languages, but it's not a trivial feature. I'm hoping in the next month or so to come up with a reference implementation that might help to illuminate the issue.
for(a
of/a/)a
This is also, in principle, not deterministic context-free. It might be possible to create a cover grammar that addresses this, but it would also likely involve a fair bit of duplication. I'll take a look at the ECMAScript grammar and see what can be done.
~class
extends
/\//
{}
/a
~class extends class{}{}/a
Have you tried #1581?
Sublime has no analogue to this.
It doesn't mean it cannot be implemented. I don't see the problem: if something cannot be parsed using the current sublime's syntax parser, then extend the set of parser's capabilities. Refactor the sublime's syntax notation, introduce new functionalities which will provide a way of parsing JavaScript syntax.
Open to suggestions.
My suggestion is: implement new syntax notation for the sublime highlighter (sublime-syntax
files) which will be able to parse JavaScript syntax properly. It may be done in sublime's core or outside, but definitelly not in a 3rd plugin, since the core parser should properly parse JavaScript.
These are unfixable.
Sorry, but I reject to believe in that. Everything can be fixed, the only question is the cost of the fix.
I have proposed a syntax extension that would allow Sublime to parse arbitrary (not-necessarily-deterministic) context-free languages
Sounds awesome, but only when it becomes part of sublime's core (I don't want to install 3rd party plugins in order to properly highlight JavaScript)
To summarize, these test cases are not "unfixable". It may be currently hard to fix them because of sublime's poor syntax notation capabilities, but I hope it will be improved in the near future. Of course, according to the fact that some of these test cases are almost nowhere seen in real code, I don't insist this to be fixed immediatelly, but I also disagree that such test cases can be simply ignored. Sublime is the most overpowered text editor out there, so it should parse JavaScript properly. These issues may be fixed in a month, in a year, or in 10 or 20 years, but they should be fixed and not closed as wontfix, since they represent a valid syntax.
These are unfixable. JavaScript's syntax in these places is not deterministic context-free. All that Sublime can do is make a reasonable guess and try to recover gracefully from a wrong guess. If the highlighting fails ungracefully, that can and should be fixed.
What about this? The code looks disgusting. I don't think this is acceptable...
This repository is for the default packages. Suggestions for new core functionality, including extensions to the sublime-syntax format, should be made in the core issues tracker as they are outside the scope of this repository. The corner cases in question cannot be fixed by any change within the scope of this repository. I have already linked the issue on the core repository that would allow those cases to be addressed here.
Sounds awesome, but only when it becomes part of sublime's core (I don't want to install 3rd party plugins in order to properly highlight JavaScript)
That proposal cannot be implemented by a third-party plugin. The parser is built into the Sublime core and is not extensible via plugins.
What about this? The code looks disgusting. I don't think this is acceptable...
Are you sure that your interpretation of that snippet is correct?
Are you sure that your interpretation of that snippet is correct?
Yes, I am sure. Please copy-paste the following snippet in the browser console and you will see 123
printed. It cannot be printed from a comment, right...
(
a)=>{}
/a\/*/
someFunction();
function someFunction(){
var str = "This is a valid JavaScript code, not a comment";
Thom1729_said: 'If the highlighting fails ungracefully, that can and should be fixed.';
console.log('123')
}
//*/
I've tracked down the issue. Arrow function expressions are specified strangely in the formal grammar so that if the ConciseBody
is an AssignmentExpression
, operators other than the comma operator will bind tightly within that expression. When the body is a block, this has the side effect of forbidding certain constructs involving operators on arrow functions. The core Sublime syntax does not account for this, instead treating arrow functions like PrimaryExpression
s. The distinction is sufficiently obscure that I suspected that the fault lay with Chrome's devtools until I traced the parse manually through the spec.
This is a fixable bug. It is not related to the arrow function parameter list issue; the existing fallthrough logic for arrow function parameter lists misparsed as parenthesized expressions works just fine. The problem begins after the block. If the arrow function were a PrimaryExpression
, then the subsequent slash would denote a division operator, but because of the unique way in which arrow functions are specified a division operator would be invalid, and so the ASI algorithm inserts a semicolon immediately before the slash, which is of course then parsed as the beginning of a regular expression literal.
Here's a simpler example:
x=>{}
/a/g;
Somewhat alarmingly, both Babel and Uglify parse this incorrectly. At least Sublime is in good company! On the other hand, I would expect most unsophisticated syntax highlighters (such as GitHub's) to mark this correctly, because it would trigger the obvious dumb heuristics for regexp detection.
This is a nice find. How did you run into this? Not in production code, I hope.
Due to the complexity of the bug, it may be a few days before I have the time to design a solution.
Somewhat alarmingly, both Babel and Uglify parse this incorrectly. At least Sublime is in good company
In a good, but not in the best. For example, CodeMirror gets it right.
I'm not sure that I'd consider CodeMirror to be "the best" company. The syntax highlighting seems to be a dumb tokenizer. As I mentioned, I would expect most such systems to highlight this example correctly by pure guesswork. Unfortunately, such heuristics will inevitably guess wrong in many cases. See PR #1009 for a discussion and some links.
Most syntax highlighters use simple heuristics, with generally tolerable results. It would be unreasonable to expect such a highlighter to reliably handle complex syntax. At the opposite extreme, some editors build a real abstract syntax tree. This is theoretically ideal, but the drawbacks are significant. Implementing a language requires implementing a full parser and tokenizer or porting an existing one, so most such editors support only one language or a small handful of languages. In addition, full-blown AST parsers are much more resource-intensive than dumb heuristic highlighters and may not be suited to real-time code editing.
Sublime's syntax highlighting engine is intermediate between these extremes. It implements a real deterministic context-free parser, allowing it to exactly parse many common languages and to do a very good job with nearly all common languages. But the engine is very fast -- guaranteed linear time, I believe, outside of edge cases. Incremental parsing is supported, which is very important for a real-time editor. This efficiency comes at a cost: incremental parsing limits lookahead, while the linear time guarantee limits the parser to deterministic context-free languages.
JavaScript is context-free, but not deterministic context-free. Python is not context-free at all. Extending Sublime's engine to exactly handle more general languages is not easy in the first place, but any such extension must also preserve reasonable time complexity guarantees for most syntaxes. This is a difficult problem to solve. I think that it's worth solving, which is why I wrote my proposal for nondeterministic parsing, and why I'm working on my own parser implementation to use as a testbed for such extensions.
It's important to maintain perspective. Sublime's syntax highlighting engine is already the best in the business by a comfortable margin.
I think that it's worth solving, which is why I wrote my proposal for nondeterministic parsing
I know you wrote the proposal, you've said it four times. But I don't see the problem about this issue: once the proposal is implemented in the core, there may be an option like "Enable perfect parsing". If a user doesn't check that option, the highlighter will work in linear time complexity, but some edge cases will not be highlighted correctly. Otherwise, if a user checks it, all edge cases will be correctly highlighted, but linear complexity will not be guaranteed. I think it is the best idea so far, all users will be satisfied: ordinar users will use basic parsing and advanced users will use advanced parsing. But anyway, sublime should definitely provide both ways.
btw, I've reported the babel issue, I'm posting the link here for reference; https://github.com/babel/babel/issues/7993
It would be a lot harder than you might think to make such a feature optional. Either a syntax definition uses the new feature, or it doesn't. In order to make the feature optional, either there would need to be two parallel syntax definitions (a maintenance headache and inevitable source of bugs) or the syntax definition would somehow have to be crafted to (mostly) work without the feature. That sounds unlikely.
Moreover, I don't think that "ordinary" and "advanced" users have significantly different needs when it comes to syntax highlighting accuracy/performance tradeoffs. The feature would have to provide reasonable performance guarantees to be useful at all.
ok, here is my point of view:
There are text editors like VisualStudio or Eclipse which highlight the code while typing and do it perfectly (no mishighlights at all) and do it very fast (I suppose in linear time). I cannot talk about VisualStudio, because it is not open-source, but Eclipse has a parser (a serious parser, not just a highlighter or tokenizer) which runs in a separate thread and highlights the code while user types. Eclipse supports wide range of languages and provides an easy way for adding new languages.
As I said, I'm not a mathematician and I'm not familiar with the technical definition of a language, but here is my understanding: language can be considered as a set of arrays of bytes. Every element of that set represent a valid code in the given language (a code which doesn't throw syntax error when ran). According to that definition (and I suppose it is not far away from the real definition), I think the proposal Thom is talking about is not sufficient. To demonstrate what I mean, let's consider the following example:
example: let's define a new language. A string is considered to be a valid code in that language if and only if it's sha512 hash starts with byte 0x00. So, my question is: how would sublime properly highlights valid scripts as valid and invalid scripts as invalid in such language? I mean, how the syntax notation (even with the Thom's extension) would be able to distinguish them?
In most languages that is not the case, but I don't see a reason not to support even these languages. Native sublime's packages may not include such langauges, but it would be good to allow syntax package writers to support arbitrary languages. Therefore, here is my proposal:
Sorry for my inability to write this in a technical form, but anyway, here is my idea. Implement a new syntax notation in the core which will be able to parse arbitrary language and distinguish it's parts (expressions, functions, whatever it has). Suppose we have a simplified c-like language:
int a = 5;
int b = 7;
a = a + b;
int a = 10;
(the above code is invalid on purpose to demonstrate the ability of this proposal to detect it (redefinition of the variable a
))
So we basically only have variable declarations of int
type, we have initializations (like int a = 5
), we have identifiers (for variable names), we have assignments and +
(addition) operators. That's it. Here is a pseudo code of the syntax notation which would be able to parse this code:
code:
match_one_of:
statement ";"
code "\n" statement ";"
statement:
match_one_of:
variable_declaration
assignment
variable_declaration:
match_exactly:
"int " new_ident " = " integer
assignment:
match_exactly:
old_ident " = " old_ident " + " old_ident
integer:
match_exactly:
/\d+/
new_ident:
match_exactly:
/[a-zA-Z][a-zA-Z0-9]*/
check:
if($.includes(match)) return false;
$.push(match);
return true;
old_ident:
match_exactly:
/[a-zA-Z][a-zA-Z0-9]*/
check:
return !$.includes(match);
(this is a pseuodo-code, don't be surprised if you find a typo)
So, let me explain this. This syntax notation consists of non-terminals (code
, statement
, variable_declaration
, etc). Each non-terminal has it's definition which is either match_one_of
or match_exactly
(which may be even the same). This is very similar to EBNF, with two very important differences: 1) it supports regular expressions (^
at the beginning may be auto-inserted) and 2) it supports turing-complete checks (implemented in some language, here I used JavaScript because I am familiar with it).
So basically, the parser always starts from the code
non-terminal. It tries to parse the first list of terminals and non-terminals provided (among all possibilities provided under match_one_of
). Just to mention: terminals are strings under quotation marks. So, here the parser tries to parse the code
non-terminal, It choses statement
(the first from the list). So, parser's supposition is code -> statement
for now. Then parser tries variable_declaration
, as it is the first on the list of statement
. It successfully parses int a = 5;
as code -> statement -> variable_declaration
, but then it realizes there are more code. So, the supposition is wrong. Then a rollback mechanism is applied until it can prove that nothing more can be mached and that the code is invalid according to the given syntax.
The interesting part here is new_ident
and old_ident
. You can see a new label named check
. It is used for arbitrary (turing-complete) checks based on the matched string. If the checks fails, a rollback algorithm is applied. There are two variables: $
which represents a local storage object (can contain arbitrary stuff, in this case it is an array) and match
which is a string of the matched regex. If either parser cannot match the given regex at the beginning of the string (from the current position in the script) or the check fails, a rollback is applied restoring $
to the the previous state. Of course, there may be tons of optimizations to make it almost no memory or time expensive.
There are several problems with this method (but I think it is still not unsolvable). One of the problems is that someone may write syntax notation in such form that the parser will spin in a loop forever (suppose we swap lines statement ";"
and code "\n" statement
. In such case, the parser will keep increasing internal stack size, but nothing will be parsed and the sublime will crash very fast. So, the syntax notation writers must take care about the order of the definitions, especially in a complex language.
Another problem is the exponential growth of the time needed for parsing some statements. A good example is the examle Thom provided in their proposal:
(
a = (
b = (
c = (
d
) => null
) => null
) => null
) => null
This will trigger a loop with exponential complexity which for large test cases, even if runs in a separate thread, will not complete in a reasonably short time, causing the rest of the syntax to be mishighlighted. But, as thom said:
The solution, fortunately, is not very complicated. We can cache the result of each "decision" for the duration of a single parse.
I think it can be applied in my proposal too.
Another issue with my proposal is the rollback algorithm. Since $
which represents local storage may contain arbitrary stuff at any point during parsing, it is very hard to make some optimizations based on it's content. One of the solutions is to clone it every time a check
labeled script modifies it. It is very memory expensive. Maybe there can be applied some optimizations based on the script itself, but since the script that modifies it is turing-complete it is, in general impossible. The only idea that I think would be worth considering is to reduce the set of available values that can be stored in it (for example limit $
to be of type Uint8Array
, so we can know that it must contain bytes and not arbitrary objects or strings).
I think this idea is worth considering. I am anyway writing my own parser with such syntax notation for parsing Esolang, so even if sublime doesn't want to implement this, I will publish my own highlighter which is, of course not that fast, but at least it perfectly highlights all scripts in the given language, and more important: it will be opensource.
My proposal allows generating of the abstract syntax tree (which is also the result of parsing a non-terminal) which perfectly fits into the requirements: we can exactly tell what is a function, what is a variable, what is an expression and so on.
This would provide a lot of new functionalities (highlighting non-declared variables as invalid or assigning a void*
value to a variable of type int
etc). It allows expanding of c and c++ macros and other assembly directives. It also allows resolving external headers and reporting problems instead of compiler.
Either a syntax definition uses the new feature, or it doesn't.
It sounds reasonable to me. A lot of users are satisfied with the basic syntax highlighter. I really don't see a problem in writing a syntax definition in two different formats. Some languages doesn't even need this proposal, so we can keep the old syntax notation for these languages. But, in my opinion, it would be a good idea to provide a way for writing a syntax notation for arbitrary esolangs.
Even if nobody supports this, I don't think it would be a big disaster, because I am anyway working on my own parser which will be able to parse any given language, but the only problem is because I am sure sublime would do it probably a lot more neat and do a lot more optimizations, that is why I think it would be a good idea to implement it in the core.
(Just to mention, this proposal is a general idea, it has nothing to do with the JavaScript test cases from this issue. Such test cases should and must be fixed anyway)
Sublime doesn't need to highlight every computable language. Ideally, it should be able to highlight every actual programming language. The ability to compute SHA512 hashes just isn't relevant to this objective because it isn't required to parse any programming language.
Most programming languages belong to a class called the context-free languages. These languages can be represented by a context-free grammar, commonly represented in EBNF or a similar form. There are well-behaved, efficient algorithms for parsing context-free languages, particularly a subset called the deterministic context-free languages. The actual parsers used by compilers and interpreters use a range of similar algorithms that efficiently parse these classes of languages.
Sublime's parser is essentially a deterministic pushdown automaton. Sublime syntax definitions are essentially state transition tables for a DPDA. A DPDA can be run very fast -- with some caveats, in linear time. Just as importantly, the engine can be written in efficient C code and compiled to assembly. This is extremely important for a real-time syntax highlighter. In addition, Sublime's implementation does not look across line boundaries, which allows it to implement optimizations that minimize reparsing as the text changes.
Eclipse and Visual Studio allow users to essentially write their own parsing engines. These extensions are compiled to native code (or to VM code). This allows users to parse anything that's parseable by any means. There is no free lunch here; parsing more complex languages is slower. It is mathematically impossible for Visual Studio or any other software to parse JavaScript in linear time. (No one holds up those editors as models of efficiency and responsiveness in the first place.) This approach has several additional downsides:
You're proposing a format, but not an algorithm. The format supports nondeterministic context-free languages. What algorithm parses them? There are several possible answers, none of which runs in linear time. I wrote the nondeterminism proposal the way that I did because the algorithm is far more important than the definition syntax; the meat of the proposal isn't the name meta_backtracking_key
, but the specific set of optimizations that avoids performance costs in the typical case and mitigates those costs in the worst cases.
In your example, you have arbitrary JavaScript code. For Sublime, it would probably be Python. Sublime's engine does not run in Python; the overhead of invoking the Python runtime mid-parse would be tremendous. Doing so for every identifier would be disastrous. Suppose that the code should be written in C++ and compiled, and you introduce a whole new class of problems.
You yourself point out that this feature would require a total reparse on every change, and your intuition that this is unfixable is basically correct.
Finally, I would note that finding type errors and such is generally the job of linters. There are high-quality linters available for most common languages, and robust linter integration is available via the Sublime-Linter package. These linters often lag behind the user's input. It would be disastrous if the linter's work held up highlighting.
You're proposing a format, but not an algorithm.
I explained the algorithm and described it very briefly, but I also mentioned that it is very unoptimal and not for real-time parsing.
But anyway, forget about it, the main thread here is how to fix the test cases which break the highlight completely (14 out of 17 test cases I posted break the highlight). When you are idle, please open some prs with the appropriate fixes. I am sure not all 17 issues need your proposal.
The following require nondeterministic parsing:
for of
loops from regular for loops.In these cases, the best we can do is make reasonable guesses and fail gracefully.
Fixing await
/yield
would require (at a rough estimate) doubling the size of the syntax. The same concern applies to improving the failure fallback for for
loops, though to a lesser degree.
Your example 7 is fixed by #1581. Example 12 is partially fixed by #1581. Several others are marked as already fixed by various PRs.
The remaining issues are:
yield
(or await
).goto
is, unfathomably, parsed as a keyword.break
and continue
do not correctly parse labels.Thanks for those fixes!
I think the 18th is easy too (see this).
Regarding item 14, @Thom1729, that’s likely a vestige of previously targeting older versions of the language. goto
was a FutureReservedWord once, along with a surprising number of others that were later dropped (int
, char
, double
, etc — full list at § 7.5.3 of the ES3 spec).
@user-982634 brought your proposed delayed-scope-assignment/non-deterministic-parsing idea to my attention (after I mistook them for someone who used to troll our repo, oops). It’s a very convincing solution!
The ability to compute SHA512 hashes just isn't relevant to this objective because it isn't required to parse any programming language.
See Bubblegum
Issue with class extends require('./interface')
:
:)
@eusonlito
It has already been fixed.
I came here to report a similar case like No.7:
The root cause is the \/
inside a RegExp literal. I think this is trivial but quite annoyed. And it make the bracket matching engine throw false alarm also.
Regarding above issue with \/
in RegExp literal, I tried following patch in JavaScript.sublime-syntax, and so far it looks fine for me:
regexp:
- meta_include_prototype: false
- meta_scope: string.regexp.js
- - match: "/"
+ - match: '[^\]?/'
scope: punctuation.definition.string.end.js
@trongthanh LGTM. Probably it has been fixed in the current master branch. At least, I don't think I am using a custom JS syntax.
For people who are interesting in reproducing:
module.exports = {
external: [
/\/.*/,
'str'
],
rules: []
};
I can't reproduce either. I'll try to reproduce with an older branch if possible, but I may not have the chance today. In any case, the latest master seems to work.
@Thom1729 is your summary of remaining issues in https://github.com/sublimehq/Packages/issues/1600#issuecomment-391785302 still up to date? OP includes many different issues, some of which are marked as solved but not as many as your summary suggests and most of the talk in this issue is related to syntax parsing in general, not the issues at hand. While they are a very good read for the interested reader, they aren't exactly related to the reported issues and make following the status of these kind of difficult.
I suggest moving the unresolved issues, grouping those that are unfixable with the current system for archival purposes and those that can be fixed into separate issues.
@Thom1729 & @jfcherng, I just tried on my home laptop running Ubuntu and couldn't reproduce it. It's also ST3 dev build 3210. So it's pretty strange. I'll take a closer look on the machine has the issue (ST dev build 3210 on macOS) and try to find what's the real cause.
Edited: OK, It's so weird that when I removed the patched JavaScript.sublime-syntax on my problematic machine, the issue is also not reproducible. Maybe there were some intermittent config changes I made to ST3 combined with JSCustom plugin that cause the issue. So never mind.
I've created a separate issue for everything in here that isn't fixed. It should be fine to close this issue.
~await/a
a
instanceof/a/
a
in/a/
for(a
of/a/)a
~{*a(){yield
/a/}}
async a=>{for await(a of a)a}
~class
extends
/\//
{}
/a
try{}catch{}if(a)a
(a
)=>a
(
a)=>a
(
a
)=>a
~class extends class{}{}/a
a=>{}
/a/
if(0)
goto/a
a:
break a
/a/
a:for(;;)continue a
/a/
'\u{1}'