Option to match end before any patterns

KapitanOczywisty commented 4 years ago

There is persistent problem with injected languages where unclosed parentheses or quotes eat whole document.

I'm proposing to add option e.g. applyEndPatternFirst. This would check such end pattern at every line before anything else. I've looked at code a bit and it seems possible, but I'm not familiar enough with engine to correctly implement this (at least without trashing performance).

Details:

On end match - everything before end would be extracted and checked again with sub-patterns and merged with end's tokens, while rest of line would be checked normally. Without match - business as usual.
When multiple end patterns with this option will be present, they should be stored and checked from first to last, at every line until any of them matches. There shouldn't be too many of them, since this should be used mainly with injected languages.
Successful match will terminate all descendant rules, after matching "before end" part of line.
end with option should also match in line where before was found.

Why not use while? While while (sic) is great addition, it can only exclude whole line. Proposed option would allow stopping in mid-line.

alexdima commented 4 years ago

@KapitanOczywisty I think this makes sense, but vscode-textmate has as a goal to remain compliant with TM grammars. IMHO this proposal would make it that vscode-textmate diverges from TM and might lead to the creation of grammars that are not portable. I wonder what the exact problem you are running into is and how has it been addressed in grammars such as html, where the closing of </script> tag can occur in a JS comment, statement, etc.

KapitanOczywisty commented 4 years ago

@alexdima If properly used, this option will only prevent some nasty injection bugs. No support for this option shouldn't cause grammar to break (well... not more than already is). As always there might be someone who uses this in highly improper way, but even now there are slight differences in implementations, e.g. php with sql injections breaks differently in atom and vscode. What is worth to mention, we're importing php grammar from atom, which don't have while support. I suspect that some of vscode's grammars are not "portable" already, at least to the atom.

Fwiw there is nothing stopping other implementations to add this feature.

About problem: There are many injection bugs atm in different languages, e.g. countless issues about embedded SQL in PHP.

Even mentioned </script> is not addressed:

What sometimes makes funny disagreements between LSP and grammar:

We can fix some of these issues making many duplicated rules, but there are issues which could be fixed only with such feature. Also atom's language-php in vscode uses different grammar for sql, what makes fixes even more fun.

alexdima commented 4 years ago

@KapitanOczywisty Grammars (outside of injection) already do this by default, the end pattern is applied first by default and it is possible to configure in the grammar that the end pattern is applied last:

https://github.com/microsoft/vscode-textmate/blob/ebcfe99ce2d262165c53e44db71575f37750c070/src/rule.ts#L535-L555

Otherwise, if this is about injection, then is it that both the injection and the rule matches? But it looks to me like we already handle that case with some scoring:

https://github.com/microsoft/vscode-textmate/blob/ebcfe99ce2d262165c53e44db71575f37750c070/src/grammar.ts#L769-L777

More likely, what I believe is the problem is that a certain rule misses an injection for the end embedded language pattern or that certain rules are written in a way where they cannot be injected (greedily eating up text and not giving a chance to the embedded language end regex to match). But IMHO this cannot be tackled with some new flag, this is a technical flaw of TM.

KapitanOczywisty commented 4 years ago

@alexdima this is true as long as every rule is properly closed, problems starts when we have parentheses and quotes.

Take this example (not the best practice, but easy to encounter in the wild):

<?php
$query = "SELECT * FROM `table`
WHERE `id` IN ('". implode("','", $ids) ."')";

implode is eaten as part of sql

In vscode parentheses in sql seems to not be tokenized so removing quotes fixes issue

But in atom sql tokenizes parentheses and we have this beauty

There were attempts to fix this, but this made even bigger mess.

Flag will not fix everything, but will allow stopping sql tokenization at IN ('". TM is flawed, but can still be a bit better :wink:

Alternatively If there would be any way to alter tokens returned from TM, without separate tokenization in extension, I could write semanticTokenProvider to inject SQL tokens afterwards. Is there some callback like that already?

alexdima commented 4 years ago

AFAIK, the "TM way" to fix this is to have a language called "sqlindoublequotes". Then a grammar is written from scratch or adapted such that this language never eats by accident ". The outer language, like php in this case would then inject the " end rule in all the rules of the "sqlindoublequotes".

I still don't understand your proposal though, since at a point the tokenizer reaches the position at '". If the sql language has a regex like /'[^']+'/, then that will match and consume the ". The end rule injected by php has no chance to intercept anything because the " is entirely skipped. The solution is to not have a regex like /'[^']+'/ in SQL, IMHO either by having a begin/end rule for strings where each character is looked at individually to which php can inject, either by having this language aware of its embedder that does not eagerly consume ".

KapitanOczywisty commented 4 years ago

@alexdima In my proposal end with flag is checked at every line before "normal" rules. On positive match string is trimmed to the found position for second pass with "normal" rules, and then whole block is closed. Descendant rules cannot eat " because it isn't passed to them anymore.

They tried to fix that way (sqlindoublequotes), but this failed as you can see. PHP has 4 variants of writing strings:

<?php
// double quoted
$query = "SELECT * FROM `{$table["name"]}`";
// single quoted (no interpolation)
$query = 'SELECT * FROM `'. $table["name"] .'`';
// heredoc
$query = <<<SQL
SELECT * FROM `{$table["name"]}`
SQL;
// nowdoc (no interpolation)
$query = <<<'SQL'
SELECT * FROM `table`
SQL;

Writing all variants is quite an overhead. It's possible, but also time consuming. I mean I'll do it if there is no other way, but I'd rather not. And there are similar issues in other languages.

Edit: In general any option for second pass would be a huge change, even if this would be callback in JS.

alexdima commented 4 years ago

In my proposal end with flag is checked at every line before "normal" rules. On positive match string is trimmed to the found position for second pass with "normal" rules, and then whole block is closed.

That makes sense. Then I am sorry, I misunderstood your proposal. The description applyEndPatternFirst made me think this is about the order of the regexes or their priority. This is something completely new to TM, exactly like we do in Monarch.

HugoGranstrom commented 3 years ago

Any progress on this matter? I'm having similar problems embedding markdown in my language. It works fine if the end pattern is on a separate line from the "markdown content" but as soon as it's on the same line the markdown eats it and takes over the entire document.

var a = md"""
This is some md that works
"""

var b = md"""
This doesn't work"""

# This comment is now highlighted as if it was a markdown header

AlfishSoftware commented 3 days ago

This is extremely necessary. There should be a way to embed languages generically, without having to account for every possible comment/string/etc of that specific language breaking the container syntax; and having to workaround that by adding a lot of unrelated "hack" rules to fix it.

Syntax of embedded languages should be determined on a second "pass" without breaking the syntax of whatever is delimiting it in the parent language; while still allowing it to override parent escape sequences (e.g. in strings) over the embedded language.

RedCMD commented 1 day ago

The problem with adding this is that it will cause VSCode to be incompatible with TextMate2.0, Atom, Sublime etc Apple should have thought about this in the first place, but it is too late now

Atom had their own potential implementation alwaysMatchEndPattern, but it never went through https://github.com/atom/first-mate/pull/90

VSCode's while is a lot more strict that end preventing embedded languages from leaking however it is a VSCode bug https://github.com/microsoft/vscode-textmate/issues/241

microsoft / vscode-textmate

Option to match end before any patterns #139