mawww / kakoune

mawww's experiment for a better code editor
http://kakoune.org
The Unlicense
9.94k stars 716 forks source link

[RFC] command line parsing overhaul #2046

Closed mawww closed 6 years ago

mawww commented 6 years ago

Hello,

As a follow up to #2029, and in order to fix the long standing bugs we have regarding parsing, such as #1049 or #1612. I would like to clean up Kakoune's command line parsing.

The current state

Our current command line parsing is very ad-hoc, and works as follow:

All those string types can only happen at the beginning of a word, a %, " or ' appearing in the middle of a word is considered literally.

The obvious solution to #1049 would be to support escaping backslash. But this lead to escaping hell. We have many regexes in rc/ files that already refers to backslash using \\, that means we would have to write a literal backslash for regex as \\\\.

What we would want

  1. We want familiar, simple and easy to write interactive commands, so we want a whitespace separated command syntax (:command <param> <param> <param>)

  2. We want to be able to easily write regular expressions and Kakoune keys they are very common (:add-highlighter global regex \bword\b). This means we want a nice way to write backslash heavy strings without having to escape them.

  3. We want to be able to write arbitrary strings as single parameters, we want a simple algorithm generate a string that we know is correctly escaped for any content.

    • This means in particular that we cannot rely on checking for delimiters that are not in the string, as we can always imagine a string containing all possible delimiters.
  4. We want to easily write nested strings, as Kakoune follows tcl model of list of commands as strings. This is pretty much solved with %...{...} strings.

  5. We would prefer not to have to change every existing Kakoune script to fix their strings. Keeping close to existing practice would avoid having to change too much.

  6. We would prefer to be close to existing well established practice in the unix world.

  7. We would prefer to have a consistent quoting behaviour with the different strings.

Possible directions

This means we would generally use doubling up as an escape mechanism in Kakoune command line. This would be on purpose inconsistent with the shell and regular expressions, as it would avoid bad quoting interactions between the two.

For end of line escaping (line continuation), we could make an exception and use a backslash there.

Additional questions

What are your opinions on that ?

mawww commented 6 years ago

@danr @lenormf @ekie @Screwtapello @alexherbo2 @occivink @Franciman @Delapouite @casimir please provide feedback, and please ping whoever else I forgot.

Delapouite commented 6 years ago

as Kakoune follows tcl model of list of commands as strings

I'll try to familiarize myself with tcl, this page seems a good intro: https://www.tcl.tk/man/tcl8.5/tutorial/tcltutorial.html with a few relevant chapters

Screwtapello commented 6 years ago

I'm not a fan of doubling-as-escaping, because it's pretty alien to Unix convention. The most common examples I can think of are Microsoft Excel CSV files, Visual Basic, and SQL... none of which are really classic Unix.

If I could wave a magic wand and change how Kakoune's quoting works, I'd try:

A thought on reparsed expansions: shell only has reparsed expansions, but it also has quoting as an orthogonal feature, and if you combine them you can effectively get a non-reparsed expansion. If Kakoune explicitly has both parsed and reparsed expansions, then how do they interact with quoting?

What would %%{%{foo%%{bar baz}qux}} produce?

Another thought about quotation characters in unquoted words: what should we do if we're reading a word that didn't start with a quoting character, but we find a quoting character inside it? That is, an example like the ones above (foo"bar baz"qux):

mawww commented 6 years ago

Hello,

Thanks a lot for your input.

I'm not a fan of doubling-as-escaping, because it's pretty alien to Unix convention. The most common examples I can think of are Microsoft Excel CSV files, Visual Basic, and SQL... none of which are really classic Unix.

That is not entirely true, we have a very unix use of doubling as escaping in printf, where % is escaped as %%. I do agree this is not the most common style of escaping in the Unix world, but it is still used in a very well known utiliity, and likely for the same reason I consider it for Kakoune: It avoids bad interactions with backslash based quoting.

in my experience, people tend to try a thing and then sprinkle backslashes until it works, so this will probably Do What You Want until somebody tries to determine the exact escaping behaviour through experimenting, in which case it might take them a while. If we stick the above list in the documentation, hopefully people will find it and understand it

This still makes regular expressions unreadable, one would have to manually unescape the string to understand what it means, and know the details of Kakoune escaping to get it right: consider the following common regex in rc/ files: (?<!\\)(?:\\\\)*" which means a " not preceeded by an odd number of backslashes (yeah, this regex detects quoted strings). Now this is what it would look like if we need to escape backslashes: (?<!\\\\)(?:\\\\\\\\)*" This is much harder to read IMHO, counting 4 slashes seems to be the limit after which its just many slashes. Imagine we have a \b in there, wouldnt it be surprising that the \ in that \b is a literal backslash when other ones needs to be escaped ?

Regarding quotes inside unquoted words, one difference between Kakoune and the shell is that we regularly write list of keys, in which the ' and " keys appear, requiring escaping there is not very convenient for interactive use.

Finally, regarding reparsed strings, it is only available outside of double quoted strings, "%%{blah}" is never reparsed (well, it might if it is an argument of :eval or :try, but it wont be reparsed immediately during command line evaluation).

mawww commented 6 years ago

Things got a tad simpler (provided I dont find an unsolveable bug while testing), explicit reparsing is now just using evaluate-commands %sh{...} (it needed a few fixes in evaluate-commands implementation), so there is no doubling-up conflict with %%sh{...} anymore.

lenormf commented 6 years ago

I think users will benefit most from a change that takes the glue closer to what the shell provides already.

Single quoted strings could be made to work as they do in the shell: no escaping supported, no way to have a single quote inside a single quoted string. This would solve 2. for any string not containing single quotes.

Yes.

We could make use of doubling up as a quoting way, for example, single quoted strings could contain single quotes with a '' representing a single quote inside a single quoted strings. This would solve 2. and 3. but would violate 6. and 7.

Intuitively, I would read 'abc''def' as two strings concatenated (because that's what it would be in the shell), i.e. 'abcdef', so it's not very intuitive even if it fills requirements.

For double quoted strings, we could use doubling up as well, but while it would work fine for the " character, escaping % by doubling up would interact badly with the explicit reparsing syntax %%. The explicit reparsing syntax is not yet in master, so it could be changed to something else (what ?).

Do we even need explicit double-parsing (if I understand correctly: re-parse the entire string once the expansion has been expanded)? I don't think I ever need more than one iteration of parsing on strings, but maybe there's a case I'm missing.

For single words, we could not use doubling up to escape, as there are no delimiters, and doubling up whitespaces would be confusing. We could remove support for escaping whitespaces and require use of quoting for words containing whitespace. We would still need to support escaping % and ; which could be done by doubling up, we also need a way to escaped end of line, which cannot.

C.f. reply below.

Nestable % strings need not change, I think they proved to be pretty robust.

Yes.

Non nestable % string could use doubling up as well.

This could work.

This means we would generally use doubling up as an escape mechanism in Kakoune command line. This would be on purpose inconsistent with the shell and regular expressions, as it would avoid bad quoting interactions between the two.

As much as having a grammar compatible with both generic commands and regex would be nice, it seems that the two are conflicting, and the compromise that would eventually join up the two world might be a bit wonky. Maybe two sets of rules need to be set in the parser, as unelegant as that sounds?

For end of line escaping (line continuation), we could make an exception and use a backslash there.

So, just like the shell?

Should we try to follow the Shell and support non_quoted="Quoted Part" ? I think it is not necessary and breaks use of " and ' as keys.

Probably not.

Should we support escaping % in non quoted words elsewhere than at the first character ?

Probably not.

Should we mandate that ; as an end of command separator be separated as its own word (require a whitespace before it ?). This would avoid needing to escape it.

Maybe, but how would users just have a single character word with ; in it? Seems like the problem is the same.

What syntax should we use for explicit reparsing if we want to keep doubling up for escaping ?

No idea, I don't really know what kind of grammar would be set if all my answers above had to be taken into account, if it would make sense or not.

drwilly commented 6 years ago

With regard to 3 I would like to point to sed. Most people know the s/x/y/z, but fewer people know that the / is not defined by sed. Instead it uses the first character following the initial s. That means you could just as well write s|x|y|z or s_x_y_z.

This effectively circumvents the problem of finding a one-size-fits-all delimiter, as the choice is simply delegated to the user. As an example you could express the a " not preceeded by an odd number of backslashes-regex as %re_(?<!\\)(?:\\\\)*"_ using _ as a delimiter, or any other character you see fit.

mawww commented 6 years ago

@drwilly We do support custom delimiters with % strings, %~blah~ uses ~ as delimiter. Unfortunately, as added on the 3rd point, it does not fully solve the problem as we still cannot use that for arbitrary strings (the string that contains all punctuation unicode codepoints for example is not escapable with a custom delemiters, as they are all used in the string). It is still a very nice feature for interactive use/scripting, but not for containing arbitrary strings like a completion list generated by an external program.

@lenormf

Do we even need explicit double-parsing (if I understand correctly: re-parse the entire string once the expansion has been expanded)? I don't think I ever need more than one iteration of parsing on strings, but maybe there's a case I'm missing.

Not anymore, the new direction is to just explicitely call evaluate-commands %sh{...} to get reparsing. I did the change on the explicit-reparsing branch and it looks pretty clean in scripts.

As much as having a grammar compatible with both generic commands and regex would be nice, it seems that the two are conflicting, and the compromise that would eventually join up the two world might be a bit wonky. Maybe two sets of rules need to be set in the parser, as unelegant as that sounds?

A big part of existing scripts is defining regex for highlighting, I estimate (roughly) 1/6th of the bundled scripts lines of code are dealing with a regex, this is an important use case that we need to cater for.

I am not sure we really need to follow the shell with 'abc''def' == 'abcdef' I makes sense for the shell where my_var="my value with spaces" is a common pattern, but we have no such things in Kakoune (except for ui_options, but thats a very specific case).

Now that %% reparsing is gone, I think doubling-up solves a lot of problems, and while it breaks constraint 6. (slightly, doubling up is not without precedents as with printf), it allows for consistent escaping in command line (but is inconsistent with escaping in regex, which is necessary to avoid escaping hell). I agree it is less familiar, but % strings are not familiar at all, and yet we rely a lot on them.

That said, my mind is not made up yet, I am happy to get feedback and/or alternate suggestions.

Franciman commented 6 years ago

@drwilly made me think... What if the escape character was not \ but a different one? (E.g. , ) And it should be escapeable too. This breaks the scripts, but allows us to keep writing unescaped \ which look so ubiquitous in shell and regex. In this way We differentiate the two levels of escaping: at each level we use an escape character that does not mean anything special at the other levels, thus avoiding escape hell, I guess. Although, this is not so natural...

casimir commented 6 years ago

@mawww about your commit message. I don't get if %~...~ is (1) a special case of %-string that is considered a quoted string or if (2) any %-string can be either a quoted or a balanced string.

Either way the change about nested delimiters in %-strings is very welcome.

Also about non quoted words, just to be sure I understand, is this correct the correct result: \%a\b\\c\\ a == %a\b\\c\ a?

Screwtapello commented 6 years ago

A quoted string can be a % followed by a grouping character ((, [ or {), in which case the the string lasts until the matching grouping character, and no escaping is possible.

A quoted string can also be a ", ', or a % followed by any non-grouping character (%~blah~ is the example @mawww usually uses), in which case the string lasts until the next occurrence of that character, but nested occurrences can be escaped by doubling them.

All the following examples mean the same thing:

The reason that Kakoune supports so many kinds of quoting is because escaping is tedious, so Kakoune makes it easy to find a string delimiter that doesn't appear in the string you're quoting.

mawww commented 6 years ago

To paraphrase what @Screwtapello said, we support '...' strings, "..." strings and %{...} strings. %{...} strings have always supported different delimiters, such as %[...] or %<...>, but at some point it was decided to add support for non balanced delimiters as well. As we could not use the same closes-at-the-balanced-matching delimiter for those, those %~...~ strings (with ~ any punctuation character not in []{}<>()) have used the same escaping logic as '...' and "..." strings (which, up to now, has been some kind of ad-hoc escaping, with a backslash).

The goal of this change is to have a well defined, robust, and practical escaping. The previous system was generally practical (the existing kak scripts show that it worked well enough), but ill-defined (I bet almost nobody understand clearly how it works, did you know that %{abc}def parses as two separate words ('abc' and 'def')), and not robust (we cannot express arbitrary strings, quoted strings cannot end with a backslash, and balanced strings require their content to be balanced on their delimiter).

Your understanding of non-quoted-words is correct.

Hopefully this clears why we have those different syntax and how they relate to each other.

casimir commented 6 years ago

My understanding was that grouping characters and punctuation was the same thing.

@Screwtapello @mawww thanks for the explanations, it's not crystal clear yet but time will come.