mawww / kakoune

mawww's experiment for a better code editor
http://kakoune.org
The Unlicense
10.01k stars 714 forks source link

Add a kakquote function to the prelude of shell blocks #3340

Open Screwtapello opened 4 years ago

Screwtapello commented 4 years ago

Since Kakoune's quoting system was reworked, it's pretty easy to reliably quote Kakoune strings by just doubling apostrophes. In shell, it looks something like this:

kakquote() {
    printf "'"
    printf "$*" | sed "s/'/''/g"
    printf "'"
}

However, while working on #3336, I've copied and pasted this fragment into three or four shell blocks already; I expect it (or something like it) is already present in a bunch of other scripts too.

We already inject environment variables into shell blocks; how about we prepend this helpful and near-universally-useful function as well?

(it should be possible to implement this in pure POSIX shell without sed, but it would be too complex. If avoiding a fork turns out to be a performance improvement, it'd be nice to make the change once in Kakoune's shell prelude, and not have to copy/paste the change into thousands of existing Kakoune scripts)

lenormf commented 4 years ago

Related #1224 #1430

alexherbo2 commented 4 years ago

@Screwtapello How about quoting each argument, instead of splatting?

kak_escape() {
  for argument do
    printf "'"
    printf '%s' "$argument" | sed "s/'/''/g"
    printf "'"
    printf ' '
  done
}
alexherbo2 commented 4 years ago

@Screwtapello I’m using the following pattern for my scripts.

rc/prelude.sh

kak_escape() {
  for argument do
    printf "'"
    printf '%s' "$argument" | sed "s/'/''/g"
    printf "'"
    printf ' '
  done
}

rc/chronic.kak

declare-option -hidden str chronic_path %sh(dirname "$kak_source")

provide-module chronic %{
  define-command chronic -params .. -docstring 'Pipe to Chronic' %{
    evaluate-commands -save-regs '|' %{
      set-register | %sh{
        . "$kak_opt_chronic_path/prelude.sh"
        kak_quoted_arguments=$(kak_escape "$@")
        printf 'chronic %s' "$kak_quoted_arguments"
      }
      execute-keys '|<ret>'
    }
  }
}

require-module chronic
Screwtapello commented 4 years ago

How about quoting each argument, instead of splatting?

I like the modularity of having kakquote do exactly one thing, so I can use a printf format string to build up the command, and be sure I've quoted each argument properly:

printf "eval -try-client %s echo -markup {Information}%s" \
    "$(kakquote "$kak_client")"
    "$(kakquote "$msg")"

...but now that I think about it, I could still do that with a quoter that quoted each argument individually. In fact, if I quoted each argument individually, I could do this:

kakquote eval -try-client "$kak_client" echo -markup {Information}"$msg"

...which is a lot nicer, although it quotes more than strictly necessary.

My one fear with processing arguments individually is that it might be a bit more expensive than quoting the whole string at once. But rather than guess wildly, I decided to write a little benchmark:

log() { echo "$*" >&2; }

test_quoter() {
    actual=$("$1" "Don't Let's Start")
    expected="'Don''t Let''s Start' "

    if [ "$actual" != "$expected" ]; then
        log "Quoter $1 failed: got $actual expected $expected"
        return 1
    fi
}

benchmark_quoter() {
    quoter="$1"
    timeout=5
    (
        trap "printf '%s\\t%5d\\n' \$quoter \$i >&2; exit" TERM
        i=0
        while true; do
            "$1" "Don't" "Let's" "Start" >/dev/null
            i=$(( i + 1 ))
        done
    ) &
    sleep "$timeout"
    kill $!
    wait
}

single_sed_quoter() {
    printf "'"
    printf "%s" "$*" | sed "s/'/''/g"
    printf "' "
}

single_flat_sed_quoter() {
    printf "'%s' " "$(printf "%s" "$*" | sed "s/'/''/g")"
}

multi_sed_quoter() {
    for each; do
        printf "'"
        printf "%s" "$each" | sed "s/'/''/g"
        printf "' "
    done
}

single_builtin_quoter() {
    text="$*"
    printf "'"
    while [ -n "$text" ]; do
        case "$text" in
            *"'"*)
                printf "%s" "${text%%\'*}"
                printf "''"
                text=${text#*\'}
                ;;
            *)
                printf "%s" "$text"
                text=""
                ;;
        esac
    done
    printf "' "
}

multi_builtin_quoter() {
    for text; do
        printf "'"
        while [ -n "$text" ]; do
            case "$text" in
                *"'"*)
                    printf "%s" "${text%%\'*}"
                    printf "''"
                    text=${text#*\'}
                    ;;
                *)
                    printf "%s" "$text"
                    text=""
                    ;;
            esac
        done
        printf "' "
    done
}

quoters="
    single_sed_quoter
    single_flat_sed_quoter
    multi_sed_quoter
    single_builtin_quoter
    multi_builtin_quoter
"

for each in $quoters; do
    test_quoter "$each" && benchmark_quoter "$each"
done

single_sed_quoter is basically my original kakquote implementation above.

single_flat_sed_quoter is the same thing, but squished into a single line, as suggested by lenormf.

multi_sed_quoter is basically alexherbo2's quote-arguments-individually implementation.

single_builtin_quoter implements quoting using only shell-builtins instead of launching sed as an external process. I know that launching external processes was frowned on back in the day, but I don't know how much of an impact it has on a modern system, so I figured I'd try it.

multi_builtin_quoter is the builtin quoting algorithm running once-per-argument (like alexherbo's suggestion).

Note that test_quoter() only tests the single-argument case, so it should work properly for both $@ and $* quoters. benchmark_quoter() provides multiple arguments, so quoters that do per-argument work should be slower.

Here's the results for whatever version of dash Debian Testing is currently using (this is an edited screenshot; each result is the best of 3 runs):

single_sed_quoter    1696
single_flat_sed_quoter   1391
multi_sed_quoter      597
single_builtin_quoter   61700
multi_builtin_quoter    42153

busybox sh is similar (again, each result is the best of 3):

single_sed_quoter    1645
single_flat_sed_quoter   1350
multi_sed_quoter      573
single_builtin_quoter   35660
multi_builtin_quoter    23224

bash is quite a bit slower (again, each result in the best of 3):

single_sed_quoter    1424
single_flat_sed_quoter    978
multi_sed_quoter      470
single_builtin_quoter   15771
multi_builtin_quoter    10567

My original single_sed_quoter function is pretty similar everywhere, interestingly enough.

single_flat_sed_quoter is a bit slower, which surprises me, but I guess it's doing an additional string concatenation in RAM instead of just writing bytes to stdout?

multi_sed_quoter invokes sed three times instead of one, and runs at about a third the speed of the single quoter. That's reasonable, but it's probably not the implementation we want in the prelude.

single_builtin_quoter is an ugly implementation, but wow is it fast. Turns out avoiding process forks is still pretty important! It ranges from 11.0 times as fast as single_sed_quoter, up to 36.4 times as fast!

multi_builtin_quoter is about two thirds of the speed of single_builtin_quoter, although I bet that depends heavily on the number and size of the arguments. Still, it doesn't slow down as much as multi_sed_quoter slows down, and it's still 7.4 to 24.9 times as fast as the fastest sed-based quoter!

Overall, I think multi_builtin_quoter is the right choice for the prelude - it's pretty fast, and it allows the very pleasant kakquote echo -markup {Error}"$msg" style. It's too ugly for me to copy/paste into every %sh{} block, so it would really benefit from being in the prelude.

lenormf commented 4 years ago

Note that printf can be a binary on some systems (I think some BSD), so relying on it being a built-in with a zero-overhead cost can backfire.

If the numbers above are the amount of times a given function has run in 5s, then here is the time spent on one iteration (first dataset only):

single_sed_quoter       2.95ms
single_flat_sed_quoter  3.59ms
multi_sed_quoter        8.37ms
single_builtin_quoter   0.08ms
multi_builtin_quoter    0.12ms

multi_sed_quoter is not an implementation being considered because it spawns way too many processes, so we can discard it.

The other implementations all take less than 5ms to run, so knowing which one is N times faster than others is not relevant, because a single iteration of those implementations represents a negligible time loss, considering we're in the shell world.

single_flat_sed_quoter looks better to me because it can be copy-pasted easily in scripts.

occivink commented 4 years ago

I'd argue that ~5ms is much too expensive for something as trivial as escaping a string. Consider that it's not uncommon to generate option lists and to have to escape each element.

OpenBSD's sh is the only platform shell that I've seen which doesn't have printf as builtin, and it would be trivial (although not obvious) for an OpenBSD user to set KAKOUNE_POSIX_SHELL to dash or any other shell. In addition, forking on windows (and on mac to a lesser degree) is much more expensive than on linux.

Screwtapello commented 4 years ago

I have some new measurements. I was pointed to @occivink's quoting function in his kakoune-snippets package, which double-escapes apostrophes for some reason, but I edited it to match the other implementations:

occivink_builtin_quoter()
{
    rest="$*"
    printf "'"
    while :; do
        beforequote="${rest%%"'"*}"
        if [ "$rest" = "$beforequote" ]; then
            printf %s "$rest"
            break
        fi
        printf "%s''" "$beforequote"
        rest="${rest#*"'"}"
    done
    printf "' "
}

I also figured out how to escape a value entirely within sed... which is not as nice as doing it entirely outside of sed, but it can fit on a single line:

single_all_sed_quoter() {
    printf "%s" "$*" | sed -e "s/'/''/g; 1s/^/'/; \$s/\$/' /"
}

multi_all_sed_quoter() {
    for arg; do
        printf "%s" "$arg" | sed -e "s/'/''/g; 1s/^/'/; \$s/\$/' /"
    done
}

Results for these new implementations, on the same laptop as before, best of 3 runs:

single_all_sed_quoter   1700
multi_all_sed_quoter      585
occivink_builtin_quoter 61013

single_all_sed_quoter performs identically to single_sed_quoter, but fits into 80 chars if you use a shorter function name, which makes it nice for non-prelude use as lenormf points out.

multi_all_sed_quoter performs identically to multi_sed_quoter.

occivink_builtin_quoter performs just a little bit slower than single_builtin_quoter, which I found interesting. When you use builtins to slice a string up to the next apostrophe and there is no apostrophe, you get back the original string rather than an empty string, so you have to treat the last chunk of text specially. occivink tested to see if the slice matched the unsliced value, while I used a case to search for the string. I honestly wasn't sure which would be faster.

Screwtapello commented 4 years ago

Apparently once you have a benchmarking harness, you can't stop coming up with new variants to benchmark. I have some new variations based on single_builtin_quoter, with the following changes:

The new versions look like this:

single_builtin_quoter_all_backslashes_ntmp() {
    text="$*"
    printf "'"
    while true; do
        case "$text" in
            *\'*)
                printf "%s''" "${text%%\'*}"
                text=${text#*\'}
                ;;
            *)
                printf "%s' " "$text"
                break
                ;;
        esac
    done
}

multi_builtin_quoter_all_backslashes_ntmp() {
    for text; do
        printf "'"
        while true; do
            case "$text" in
                *\'*)
                    printf "%s''" "${text%%\'*}"
                    text=${text#*\'}
                    ;;
                *)
                    printf "%s' " "$text"
                    break
                    ;;
            esac
        done
    done
}

(the "ntmp" in the name is short for "no test, merged printfs")

Benchmarking them (as before):

single_builtin_quoter_all_backslashes_ntmp  82609
multi_builtin_quoter_all_backslashes_ntmp   61654

Converting to ms for comparison with lenormf's table, that's:

single_builtin_quoter_all_backslashes_ntmp  0.06ms
multi_builtin_quoter_all_backslashes_ntmp   0.08ms

Basically, the new multi-quoter is as fast as the old single-quoter, and the new single-quoter is even faster.

My current position:

alexherbo2 commented 4 years ago

@Screwtapello Does it slow it down to make the implementation a bit more readable?

# Prototype:
# kak_escape [text…]
#
# Description:
# Similar to `shell_escape` you may find in other programming languages,
# `kak_escape` escapes each argument so that it can be safely passed to Kakoune.
#
# Implementation:
# Single quotes each argument and doubles the single quotes inside.
#
# Note:
# The resulted text should be used unquoted and is not intended for use in double quotes, nor in single quotes.
#
# Example:
# kak_escape evaluate-commands -try-client "$KAKOUNE_CLIENT" 'echo Tchou' | kak -p "$KAKOUNE_SESSION"
#
kak_escape() {
  for text do
    printf "'"
    while true; do
      case "$text" in
        *"'"*)
          head=${text%%"'"*}
          tail=${text#*"'"}
          printf "%s''" "$head"
          text=$tail
          ;;
        *)
          printf "%s' " "$text"
          break
          ;;
      esac
    done
  done
  printf '\n'
}
Screwtapello commented 4 years ago

It does slow things down a bit, although you'll have to decide whether it's significant enough to worry about.

multi_builtin_quoter_all_backslashes_ntmp 61000
alexherbo2_readable_version               56514
alexherbo2_all_backslashes                57552

multi_builtin_quoter_all_backslashes_ntmp is the same fast implementation as before, although the score is a little different because it always varies from run to run.

alexherbo2_readable_version is the kak_escape() you pasted, unmodified.

alexherbo2_all_backslashes is the same thing, but with each instance of "'" replaced with \'. It's consistently about 1000 iterations per second faster in each of the three trials I ran. It's still 3000-4000 iterations slower than the fastest multi-quoter.

alexherbo2 commented 4 years ago

@Screwtapello Do you think a final newline character should be added to allow multiple invocations of the function?

Example:

kak_escape set-register / "$regex"
kak_escape echo -markup "$message"
Screwtapello commented 4 years ago

Printing a final newline makes that use-case much more convenient, and it wouldn't prevent people from saying $(kak_escape blah) (since $() trims the trailing newline). Is there some other situation here a trailing newline would cause problems? I can't think of one, but I also can't convince myself it's impossible.

I've been using multi_builtin_quoter_all_backslashes_ntmp() as kakquote() in my wiki plugin, I'll try adding a trailing newline and see how it goes.

voroskoi commented 4 years ago

This escaping is so common. I could imagine an helpers.kak file in rc dir, which could be loaded as a plugin. I do not think that would be much of an overhead. Is it? That would be much simpler than paste it to every single script.

alexherbo2 commented 4 years ago

3782

mralusw commented 3 years ago

Build up the string & only printf once

I think you can get even more performance by building up the string incrementally and then printf-ing at the end (multi_builtin_quoter_allbkslash_nt1p):

    out=
    for text; do
        out=$out"'"
        while :; do
            case "$text" in
                *\'*)
                    out=$out${text%%\'*}\'\'
                    text=${text#*\'}
                    ;;
                *)
                    out=$out$text"' "
                    break
                    ;;
            esac
        done
    done
    printf '%s' "$out"

Results

# dash
single_all_sed_quoter    1444
multi_builtin_quoter_all_backslashes_ntmp       119916
multi_builtin_quoter_allbkslash_nt1p    141666
# ash
single_all_sed_quoter    1447
multi_builtin_quoter_all_backslashes_ntmp       113750
multi_builtin_quoter_allbkslash_nt1p    135833
# busybox sh
single_all_sed_quoter    3225
multi_builtin_quoter_all_backslashes_ntmp       68463
multi_builtin_quoter_allbkslash_nt1p    81351
# bash
single_all_sed_quoter    1054
multi_builtin_quoter_all_backslashes_ntmp       28640
multi_builtin_quoter_allbkslash_nt1p    32585

Testing

I put a snippet up (which also accepts --timeout=, --quoters='q1 q2'). So (desktop i7, linux 5.10.45):

for s in dash ash 'busybox sh' bash; do
  echo "# $s"
  curl -s 'https://gitlab.com/kstr0k/mru-files.kak/-/snippets/2137556/raw/main/kak_quoter_benchmark.sh' | 
    ($s -s -- --timeout=3)
done
Screwtapello commented 3 years ago

Wow. I was pretty sure I'd squeezed all the juice out of that function already, but clearly not! I had assumed that printing was cheap (because stdout buffering) and string concatenation was expensive (because O(N²)) but apparently it's the other way around in this use-case. That's good to know!

mralusw commented 3 years ago

The devil with O(n) is in the hidden constant, as they say. It might be very different for longer strings (and there's also a 2x memory cost). In that case, however, it's probably a lot faster to call sed than to iterate in shell.

But actually it gets even more interesting. By pre-collapsing the strings a-la occivink, and avoiding variables completely (sh has $1, $2 etc which can be assigned via set --):

multi_builtin_quoter_allbkslash_nt1pset() {
    set -- "$*" ''  # $1=current $2=accumulator
    while :; do
        case "$1" in
            *\'*)
                set -- "${1#*\'}" "$2${1%%\'*}"\'\'
                ;;
            *)
                printf '%s' "'$2$1' "
                break
                ;;
        esac
    done
}

The results are some 50% faster compared to _ntmp:

# dash
single_all_sed_quoter    2810
multi_builtin_quoter_all_backslashes_ntmp       117752
multi_builtin_quoter_allbkslash_nt1p    137121
multi_builtin_quoter_allbkslash_nt1pset 176301
# ash
single_all_sed_quoter    2825
multi_builtin_quoter_all_backslashes_ntmp       119179
multi_builtin_quoter_allbkslash_nt1p    137391
multi_builtin_quoter_allbkslash_nt1pset 178956
# busybox sh
single_all_sed_quoter    5528
multi_builtin_quoter_all_backslashes_ntmp       65136
multi_builtin_quoter_allbkslash_nt1p    77243
multi_builtin_quoter_allbkslash_nt1pset 106483
# bash
single_all_sed_quoter    2074
multi_builtin_quoter_all_backslashes_ntmp       24680
multi_builtin_quoter_allbkslash_nt1p    28151
multi_builtin_quoter_allbkslash_nt1pset 37910
# zsh --emulate sh
single_all_sed_quoter    2213
multi_builtin_quoter_all_backslashes_ntmp       25533
multi_builtin_quoter_allbkslash_nt1p    29914
multi_builtin_quoter_allbkslash_nt1pset 37989
mralusw commented 3 years ago

... but of course, multi_*_nt1pset is actually a misnamed single-quoter, so I was comparing apples to coconuts. It's still some 10% faster than single_*_ntmp in most shell AFAICT. Sorry about that.