Hello,

As a follow up to #2029, and in order to fix the long standing bugs we have regarding parsing, such as #1049 or #1612. I would like to clean up Kakoune's command line parsing.

The current state

Our current command line parsing is very ad-hoc, and works as follow:

Non quoted words support escaping of %, whitespaces and ;, which must be preceded by a backslash to avoid their special meaning. % only has a special meaning if it appears as the first character of the word (in which case we try to parse it as a %...{...} string), but escaping it anywhere in the word works. backslash themselves are not escaped, so a\b is a word composed of three character: a, \ and b. That means it is not possible to end a word with a backslash, as it will interpreted as escaping the whitespace that should end the word.
single quoted strings support escaping of the ' delimiter, similarly to non quoted words, we cannot end such a string with a backslash.
double quoted strings support escaping the " delimiter and cannot end with backslash. They are then reparsed and backslash-escaped % are replaced with %, non-escaped ones are treated as a %...{...} string and expanded. :echo "%{a}%{b}\%" therefore outputs ab%
%...{...} strings support backslash-escaping their delimiter if its not nestable (not {, [, < or (). Nestable delimiters are not quotable.

All those string types can only happen at the beginning of a word, a %, " or ' appearing in the middle of a word is considered literally.

The obvious solution to #1049 would be to support escaping backslash. But this lead to escaping hell. We have many regexes in rc/ files that already refers to backslash using \\, that means we would have to write a literal backslash for regex as \\\\.

What we would want

We want familiar, simple and easy to write interactive commands, so we want a whitespace separated command syntax (:command <param> <param> <param>)
We want to be able to easily write regular expressions and Kakoune keys they are very common (:add-highlighter global regex \bword\b). This means we want a nice way to write backslash heavy strings without having to escape them.
We want to be able to write arbitrary strings as single parameters, we want a simple algorithm generate a string that we know is correctly escaped for any content.
- This means in particular that we cannot rely on checking for delimiters that are not in the string, as we can always imagine a string containing all possible delimiters.
We want to easily write nested strings, as Kakoune follows tcl model of list of commands as strings. This is pretty much solved with %...{...} strings.
We would prefer not to have to change every existing Kakoune script to fix their strings. Keeping close to existing practice would avoid having to change too much.
We would prefer to be close to existing well established practice in the unix world.
We would prefer to have a consistent quoting behaviour with the different strings.

Possible directions

Single quoted strings could be made to work as they do in the shell: no escaping supported, no way to have a single quote inside a single quoted string. This would solve 2. for any string not containing single quotes.
We could make use of doubling up as a quoting way, for example, single quoted strings could contain single quotes with a '' representing a single quote inside a single quoted strings. This would solve 2. and 3. but would violate 6. and 7.
For double quoted strings, we could use doubling up as well, but while it would work fine for the " character, ~escaping % by doubling up would interact badly with the explicit reparsing syntax %%. The explicit reparsing syntax is not yet in master, so it could be changed to something else (what ?).~
For single words, we could not use doubling up to escape, as there are no delimiters, and doubling up whitespaces would be confusing. We could remove support for escaping whitespaces and require use of quoting for words containing whitespace. We would still need to support escaping % and ; which could be done by doubling up, we also need a way to escaped end of line, which cannot.
Nestable % strings need not change, I think they proved to be pretty robust.
Non nestable % string could use doubling up as well.

This means we would generally use doubling up as an escape mechanism in Kakoune command line. This would be on purpose inconsistent with the shell and regular expressions, as it would avoid bad quoting interactions between the two.

For end of line escaping (line continuation), we could make an exception and use a backslash there.

Additional questions

Should we try to follow the Shell and support non_quoted="Quoted Part" ? I think it is not necessary and breaks use of " and ' as keys.
Should we support escaping % in non quoted words elsewhere than at the first character ?
Should we mandate that ; as an end of command separator be separated as its own word (require a whitespace before it ?). This would avoid needing to escape it.
~What syntax should we use for explicit reparsing if we want to keep doubling up for escaping ?~

What are your opinions on that ?

@danr @lenormf @ekie @Screwtapello @alexherbo2 @occivink @Franciman @Delapouite @casimir please provide feedback, and please ping whoever else I forgot.

as Kakoune follows tcl model of list of commands as strings

I'll try to familiarize myself with tcl, this page seems a good intro: https://www.tcl.tk/man/tcl8.5/tutorial/tcltutorial.html with a few relevant chapters

I'm not a fan of doubling-as-escaping, because it's pretty alien to Unix convention. The most common examples I can think of are Microsoft Excel CSV files, Visual Basic, and SQL... none of which are really classic Unix.

If I could wave a magic wand and change how Kakoune's quoting works, I'd try:

Initially, words are whitespace delimited (requirement 1)
Outside of a quotation, backslash-escaping works like this:
- \" is interpreted as "
- \' is interpreted as '
- \% is interpreted as %
- \ followed by a whitespace character is interpreted as that whitespace character
- \\ is interpreted as \
- \ followed by any other character is interpreted as that character pair, so \b stays as \b
- this means you still need \\\\ if you want a regex that matches a single backslash, but other regexes should work without fuss
- in my experience, people tend to try a thing and then sprinkle backslashes until it works, so this will probably Do What You Want until somebody tries to determine the exact escaping behaviour through experimenting, in which case it might take them a while. If we stick the above list in the documentation, hopefully people will find it and understand it
- This meets requirements 2 and 3, hopefully 5
We can leave %{} quoting alone, it seems fine (requirement 4)
Because Kakoune-script is so heavily based around rewriting strings, and the best-known language built on that idea is Bourne Shell (sorry, TCL), I say quoting characters in the middle of a word should follow shell precedent and start a quoted section, and the contents of the quote will be joined to the part outside the quote (requirement 6)
single and double quotes can use the same quoting behaviour, which doesn't follow Shell or C precedent, but does follow, say, Python (requirement 7)

A thought on reparsed expansions: shell only has reparsed expansions, but it also has quoting as an orthogonal feature, and if you combine them you can effectively get a non-reparsed expansion. If Kakoune explicitly has both parsed and reparsed expansions, then how do they interact with quoting?

if "foo%%{bar baz}qux" = "foobar bazqux" (i.e. quotes win like in shell) then % versus %% can only make a difference at the very top level, which seems bad considering how much Kakoune code is wrapped in deeply-nested quotations.
if "foo%%{bar baz}qux" = "foobar" "bazqux" (i.e. reparsing wins) then a deeply-nested reparsed expansion should break all the quotations that wrap it, which also seems bad.

What would %%{%{foo%%{bar baz}qux}} produce?

Another thought about quotation characters in unquoted words: what should we do if we're reading a word that didn't start with a quoting character, but we find a quoting character inside it? That is, an example like the ones above (foo"bar baz"qux):

C says "syntax error!", which is at least unambiguous and easy to learn, and could be given a helpful error message
Shell says "just part of the same word" (producing "foobar bazqux"), which means you can opt into quoting just the tricky bit of your string, but may be difficult to implement (reading the bash(1) docs about how expansion works gives me a headache)
Kakoune currently says "just ordinary characters" (producing "foo\"bar" "baz\"qux") which seems inconsistent, especially for people expecting shell semantics, and tends to produce errors that do not easily guide people to a correct understanding of what's going on
Another alternative is to say that a quoting character always begins a new word, even if not preceeded by whitespace (producing "foo" "bar baz" "qux") which should be easy to implement and give slightly better errors (failing with "too many parameters" rather than passing weird values), but does not have much precedent

Hello,

Thanks a lot for your input.

I'm not a fan of doubling-as-escaping, because it's pretty alien to Unix convention. The most common examples I can think of are Microsoft Excel CSV files, Visual Basic, and SQL... none of which are really classic Unix.

That is not entirely true, we have a very unix use of doubling as escaping in printf, where % is escaped as %%. I do agree this is not the most common style of escaping in the Unix world, but it is still used in a very well known utiliity, and likely for the same reason I consider it for Kakoune: It avoids bad interactions with backslash based quoting.

in my experience, people tend to try a thing and then sprinkle backslashes until it works, so this will probably Do What You Want until somebody tries to determine the exact escaping behaviour through experimenting, in which case it might take them a while. If we stick the above list in the documentation, hopefully people will find it and understand it

This still makes regular expressions unreadable, one would have to manually unescape the string to understand what it means, and know the details of Kakoune escaping to get it right: consider the following common regex in rc/ files: (?<!\\)(?:\\\\)*" which means a " not preceeded by an odd number of backslashes (yeah, this regex detects quoted strings). Now this is what it would look like if we need to escape backslashes: (?<!\\\\)(?:\\\\\\\\)*" This is much harder to read IMHO, counting 4 slashes seems to be the limit after which its just many slashes. Imagine we have a \b in there, wouldnt it be surprising that the \ in that \b is a literal backslash when other ones needs to be escaped ?

Regarding quotes inside unquoted words, one difference between Kakoune and the shell is that we regularly write list of keys, in which the ' and " keys appear, requiring escaping there is not very convenient for interactive use.

Finally, regarding reparsed strings, it is only available outside of double quoted strings, "%%{blah}" is never reparsed (well, it might if it is an argument of :eval or :try, but it wont be reparsed immediately during command line evaluation).

Things got a tad simpler (provided I dont find an unsolveable bug while testing), explicit reparsing is now just using evaluate-commands %sh{...} (it needed a few fixes in evaluate-commands implementation), so there is no doubling-up conflict with %%sh{...} anymore.

I think users will benefit most from a change that takes the glue closer to what the shell provides already.

Single quoted strings could be made to work as they do in the shell: no escaping supported, no way to have a single quote inside a single quoted string. This would solve 2. for any string not containing single quotes.

Yes.

We could make use of doubling up as a quoting way, for example, single quoted strings could contain single quotes with a '' representing a single quote inside a single quoted strings. This would solve 2. and 3. but would violate 6. and 7.

Intuitively, I would read 'abc''def' as two strings concatenated (because that's what it would be in the shell), i.e. 'abcdef', so it's not very intuitive even if it fills requirements.

For double quoted strings, we could use doubling up as well, but while it would work fine for the " character, escaping % by doubling up would interact badly with the explicit reparsing syntax %%. The explicit reparsing syntax is not yet in master, so it could be changed to something else (what ?).

Do we even need explicit double-parsing (if I understand correctly: re-parse the entire string once the expansion has been expanded)? I don't think I ever need more than one iteration of parsing on strings, but maybe there's a case I'm missing.

For single words, we could not use doubling up to escape, as there are no delimiters, and doubling up whitespaces would be confusing. We could remove support for escaping whitespaces and require use of quoting for words containing whitespace. We would still need to support escaping % and ; which could be done by doubling up, we also need a way to escaped end of line, which cannot.

C.f. reply below.

Nestable % strings need not change, I think they proved to be pretty robust.

Yes.

Non nestable % string could use doubling up as well.

This could work.

This means we would generally use doubling up as an escape mechanism in Kakoune command line. This would be on purpose inconsistent with the shell and regular expressions, as it would avoid bad quoting interactions between the two.

As much as having a grammar compatible with both generic commands and regex would be nice, it seems that the two are conflicting, and the compromise that would eventually join up the two world might be a bit wonky. Maybe two sets of rules need to be set in the parser, as unelegant as that sounds?

For end of line escaping (line continuation), we could make an exception and use a backslash there.

So, just like the shell?

Should we try to follow the Shell and support non_quoted="Quoted Part" ? I think it is not necessary and breaks use of " and ' as keys.

Probably not.

Should we support escaping % in non quoted words elsewhere than at the first character ?

Probably not.

Should we mandate that ; as an end of command separator be separated as its own word (require a whitespace before it ?). This would avoid needing to escape it.

Maybe, but how would users just have a single character word with ; in it? Seems like the problem is the same.

What syntax should we use for explicit reparsing if we want to keep doubling up for escaping ?

No idea, I don't really know what kind of grammar would be set if all my answers above had to be taken into account, if it would make sense or not.

With regard to 3 I would like to point to sed. Most people know the s/x/y/z, but fewer people know that the / is not defined by sed. Instead it uses the first character following the initial s. That means you could just as well write s|x|y|z or s_x_y_z.

This effectively circumvents the problem of finding a one-size-fits-all delimiter, as the choice is simply delegated to the user. As an example you could express the a " not preceeded by an odd number of backslashes-regex as %re_(?<!\\)(?:\\\\)*"_ using _ as a delimiter, or any other character you see fit.

@drwilly We do support custom delimiters with % strings, %~blah~ uses ~ as delimiter. Unfortunately, as added on the 3rd point, it does not fully solve the problem as we still cannot use that for arbitrary strings (the string that contains all punctuation unicode codepoints for example is not escapable with a custom delemiters, as they are all used in the string). It is still a very nice feature for interactive use/scripting, but not for containing arbitrary strings like a completion list generated by an external program.

@lenormf

Do we even need explicit double-parsing (if I understand correctly: re-parse the entire string once the expansion has been expanded)? I don't think I ever need more than one iteration of parsing on strings, but maybe there's a case I'm missing.

Not anymore, the new direction is to just explicitely call evaluate-commands %sh{...} to get reparsing. I did the change on the explicit-reparsing branch and it looks pretty clean in scripts.

As much as having a grammar compatible with both generic commands and regex would be nice, it seems that the two are conflicting, and the compromise that would eventually join up the two world might be a bit wonky. Maybe two sets of rules need to be set in the parser, as unelegant as that sounds?

A big part of existing scripts is defining regex for highlighting, I estimate (roughly) 1/6th of the bundled scripts lines of code are dealing with a regex, this is an important use case that we need to cater for.

I am not sure we really need to follow the shell with 'abc''def' == 'abcdef' I makes sense for the shell where my_var="my value with spaces" is a common pattern, but we have no such things in Kakoune (except for ui_options, but thats a very specific case).

Now that %% reparsing is gone, I think doubling-up solves a lot of problems, and while it breaks constraint 6. (slightly, doubling up is not without precedents as with printf), it allows for consistent escaping in command line (but is inconsistent with escaping in regex, which is necessary to avoid escaping hell). I agree it is less familiar, but % strings are not familiar at all, and yet we rely a lot on them.

That said, my mind is not made up yet, I am happy to get feedback and/or alternate suggestions.

@drwilly made me think... What if the escape character was not \ but a different one? (E.g. , ) And it should be escapeable too. This breaks the scripts, but allows us to keep writing unescaped \ which look so ubiquitous in shell and regex. In this way We differentiate the two levels of escaping: at each level we use an escape character that does not mean anything special at the other levels, thus avoiding escape hell, I guess. Although, this is not so natural...

@mawww about your commit message. I don't get if %~...~ is (1) a special case of %-string that is considered a quoted string or if (2) any %-string can be either a quoted or a balanced string.

(1) -> why only this one and why quotes (simple and doubles) aren't enough?
(2) -> I find it hard to reason about the difference between the 2, it's like two syntaxes for the same behavior. If so why having 2 syntaxes?

Either way the change about nested delimiters in %-strings is very welcome.

Also about non quoted words, just to be sure I understand, is this correct the correct result: \%a\b\\c\\ a == %a\b\\c\ a?

A quoted string can be a % followed by a grouping character ((, [ or {), in which case the the string lasts until the matching grouping character, and no escaping is possible.

A quoted string can also be a ", ', or a % followed by any non-grouping character (%~blah~ is the example @mawww usually uses), in which case the string lasts until the next occurrence of that character, but nested occurrences can be escaped by doubling them.

All the following examples mean the same thing:

%{foo"bar}
%Xfoo"barX
'foo"bar'
"foo""bar" (since " is the quote character here, it needs to be escaped in the string)

The reason that Kakoune supports so many kinds of quoting is because escaping is tedious, so Kakoune makes it easy to find a string delimiter that doesn't appear in the string you're quoting.

To paraphrase what @Screwtapello said, we support '...' strings, "..." strings and %{...} strings. %{...} strings have always supported different delimiters, such as %[...] or %<...>, but at some point it was decided to add support for non balanced delimiters as well. As we could not use the same closes-at-the-balanced-matching delimiter for those, those %~...~ strings (with ~ any punctuation character not in []{}<>()) have used the same escaping logic as '...' and "..." strings (which, up to now, has been some kind of ad-hoc escaping, with a backslash).

The goal of this change is to have a well defined, robust, and practical escaping. The previous system was generally practical (the existing kak scripts show that it worked well enough), but ill-defined (I bet almost nobody understand clearly how it works, did you know that %{abc}def parses as two separate words ('abc' and 'def')), and not robust (we cannot express arbitrary strings, quoted strings cannot end with a backslash, and balanced strings require their content to be balanced on their delimiter).

Your understanding of non-quoted-words is correct.

Hopefully this clears why we have those different syntax and how they relate to each other.

My understanding was that grouping characters and punctuation was the same thing.

@Screwtapello @mawww thanks for the explanations, it's not crystal clear yet but time will come.

mawww / kakoune

[RFC] command line parsing overhaul #2046

The current state

What we would want

Possible directions

Additional questions