Open KapitanOczywisty opened 4 years ago
@KapitanOczywisty I think this makes sense, but vscode-textmate
has as a goal to remain compliant with TM grammars. IMHO this proposal would make it that vscode-textmate
diverges from TM and might lead to the creation of grammars that are not portable. I wonder what the exact problem you are running into is and how has it been addressed in grammars such as html
, where the closing of </script>
tag can occur in a JS comment, statement, etc.
@alexdima If properly used, this option will only prevent some nasty injection bugs. No support for this option shouldn't cause grammar to break (well... not more than already is). As always there might be someone who uses this in highly improper way, but even now there are slight differences in implementations, e.g. php with sql injections breaks differently in atom and vscode. What is worth to mention, we're importing php grammar from atom, which don't have while
support. I suspect that some of vscode's grammars are not "portable" already, at least to the atom.
Fwiw there is nothing stopping other implementations to add this feature.
About problem: There are many injection bugs atm in different languages, e.g. countless issues about embedded SQL in PHP.
Even mentioned </script>
is not addressed:
What sometimes makes funny disagreements between LSP and grammar:
We can fix some of these issues making many duplicated rules, but there are issues which could be fixed only with such feature. Also atom's language-php
in vscode uses different grammar for sql
, what makes fixes even more fun.
@KapitanOczywisty Grammars (outside of injection) already do this by default, the end
pattern is applied first by default and it is possible to configure in the grammar that the end pattern is applied last:
Otherwise, if this is about injection, then is it that both the injection and the rule matches? But it looks to me like we already handle that case with some scoring:
More likely, what I believe is the problem is that a certain rule misses an injection for the end embedded language pattern or that certain rules are written in a way where they cannot be injected (greedily eating up text and not giving a chance to the embedded language end regex to match). But IMHO this cannot be tackled with some new flag, this is a technical flaw of TM.
@alexdima this is true as long as every rule is properly closed, problems starts when we have parentheses and quotes.
Take this example (not the best practice, but easy to encounter in the wild):
<?php
$query = "SELECT * FROM `table`
WHERE `id` IN ('". implode("','", $ids) ."')";
implode
is eaten as part of sql
In vscode parentheses in sql seems to not be tokenized so removing quotes fixes issue
But in atom sql tokenizes parentheses and we have this beauty
There were attempts to fix this, but this made even bigger mess.
Flag will not fix everything, but will allow stopping sql tokenization at IN ('"
. TM is flawed, but can still be a bit better :wink:
Alternatively If there would be any way to alter tokens returned from TM, without separate tokenization in extension, I could write semanticTokenProvider to inject SQL tokens afterwards. Is there some callback like that already?
AFAIK, the "TM way" to fix this is to have a language called "sqlindoublequotes". Then a grammar is written from scratch or adapted such that this language never eats by accident "
. The outer language, like php in this case would then inject the "
end rule in all the rules of the "sqlindoublequotes".
I still don't understand your proposal though, since at a point the tokenizer reaches the position at '"
. If the sql language has a regex like /'[^']+'/
, then that will match and consume the "
. The end rule injected by php has no chance to intercept anything because the "
is entirely skipped. The solution is to not have a regex like /'[^']+'/
in SQL, IMHO either by having a begin/end rule for strings where each character is looked at individually to which php can inject, either by having this language aware of its embedder that does not eagerly consume "
.
@alexdima In my proposal end
with flag is checked at every line before "normal" rules. On positive match string is trimmed to the found position for second pass with "normal" rules, and then whole block is closed. Descendant rules cannot eat "
because it isn't passed to them anymore.
They tried to fix that way (sqlindoublequotes
), but this failed as you can see. PHP has 4 variants of writing strings:
<?php
// double quoted
$query = "SELECT * FROM `{$table["name"]}`";
// single quoted (no interpolation)
$query = 'SELECT * FROM `'. $table["name"] .'`';
// heredoc
$query = <<<SQL
SELECT * FROM `{$table["name"]}`
SQL;
// nowdoc (no interpolation)
$query = <<<'SQL'
SELECT * FROM `table`
SQL;
Writing all variants is quite an overhead. It's possible, but also time consuming. I mean I'll do it if there is no other way, but I'd rather not. And there are similar issues in other languages.
Edit: In general any option for second pass would be a huge change, even if this would be callback in JS.
In my proposal
end
with flag is checked at every line before "normal" rules. On positive match string is trimmed to the found position for second pass with "normal" rules, and then whole block is closed.
That makes sense. Then I am sorry, I misunderstood your proposal. The description applyEndPatternFirst
made me think this is about the order of the regexes or their priority. This is something completely new to TM, exactly like we do in Monarch.
Any progress on this matter? I'm having similar problems embedding markdown in my language. It works fine if the end pattern is on a separate line from the "markdown content" but as soon as it's on the same line the markdown eats it and takes over the entire document.
var a = md"""
This is some md that works
"""
var b = md"""
This doesn't work"""
# This comment is now highlighted as if it was a markdown header
This is extremely necessary. There should be a way to embed languages generically, without having to account for every possible comment/string/etc of that specific language breaking the container syntax; and having to workaround that by adding a lot of unrelated "hack" rules to fix it.
Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language.
related: https://github.com/microsoft/vscode-textmate/issues/207
The problem with adding this is that it will cause VSCode to be incompatible with TextMate2.0, Atom, Sublime etc Apple should have thought about this in the first place, but it is too late now
Atom had their own potential implementation alwaysMatchEndPattern
, but it never went through https://github.com/atom/first-mate/pull/90
VSCode's while
is a lot more strict that end
preventing embedded languages from leaking
however it is a VSCode bug https://github.com/microsoft/vscode-textmate/issues/241
There is persistent problem with injected languages where unclosed parentheses or quotes eat whole document.
I'm proposing to add option e.g.
applyEndPatternFirst
. This would check suchend
pattern at every line before anything else. I've looked at code a bit and it seems possible, but I'm not familiar enough with engine to correctly implement this (at least without trashing performance).Details:
end
match - everything beforeend
would be extracted and checked again with sub-patterns and merged withend
's tokens, while rest of line would be checked normally. Without match - business as usual.end
patterns with this option will be present, they should be stored and checked from first to last, at every line until any of them matches. There shouldn't be too many of them, since this should be used mainly with injected languages.end
with option should also match in line wherebefore
was found.Why not use
while
? Whilewhile
(sic) is great addition, it can only exclude whole line. Proposed option would allow stopping in mid-line.