Open Screwtapello opened 4 years ago
Related #1224 #1430
@Screwtapello How about quoting each argument, instead of splatting?
kak_escape() {
for argument do
printf "'"
printf '%s' "$argument" | sed "s/'/''/g"
printf "'"
printf ' '
done
}
@Screwtapello I’m using the following pattern for my scripts.
rc/prelude.sh
kak_escape() {
for argument do
printf "'"
printf '%s' "$argument" | sed "s/'/''/g"
printf "'"
printf ' '
done
}
rc/chronic.kak
declare-option -hidden str chronic_path %sh(dirname "$kak_source")
provide-module chronic %{
define-command chronic -params .. -docstring 'Pipe to Chronic' %{
evaluate-commands -save-regs '|' %{
set-register | %sh{
. "$kak_opt_chronic_path/prelude.sh"
kak_quoted_arguments=$(kak_escape "$@")
printf 'chronic %s' "$kak_quoted_arguments"
}
execute-keys '|<ret>'
}
}
}
require-module chronic
How about quoting each argument, instead of splatting?
I like the modularity of having kakquote
do exactly one thing, so I can use a printf format string to build up the command, and be sure I've quoted each argument properly:
printf "eval -try-client %s echo -markup {Information}%s" \
"$(kakquote "$kak_client")"
"$(kakquote "$msg")"
...but now that I think about it, I could still do that with a quoter that quoted each argument individually. In fact, if I quoted each argument individually, I could do this:
kakquote eval -try-client "$kak_client" echo -markup {Information}"$msg"
...which is a lot nicer, although it quotes more than strictly necessary.
My one fear with processing arguments individually is that it might be a bit more expensive than quoting the whole string at once. But rather than guess wildly, I decided to write a little benchmark:
log() { echo "$*" >&2; }
test_quoter() {
actual=$("$1" "Don't Let's Start")
expected="'Don''t Let''s Start' "
if [ "$actual" != "$expected" ]; then
log "Quoter $1 failed: got $actual expected $expected"
return 1
fi
}
benchmark_quoter() {
quoter="$1"
timeout=5
(
trap "printf '%s\\t%5d\\n' \$quoter \$i >&2; exit" TERM
i=0
while true; do
"$1" "Don't" "Let's" "Start" >/dev/null
i=$(( i + 1 ))
done
) &
sleep "$timeout"
kill $!
wait
}
single_sed_quoter() {
printf "'"
printf "%s" "$*" | sed "s/'/''/g"
printf "' "
}
single_flat_sed_quoter() {
printf "'%s' " "$(printf "%s" "$*" | sed "s/'/''/g")"
}
multi_sed_quoter() {
for each; do
printf "'"
printf "%s" "$each" | sed "s/'/''/g"
printf "' "
done
}
single_builtin_quoter() {
text="$*"
printf "'"
while [ -n "$text" ]; do
case "$text" in
*"'"*)
printf "%s" "${text%%\'*}"
printf "''"
text=${text#*\'}
;;
*)
printf "%s" "$text"
text=""
;;
esac
done
printf "' "
}
multi_builtin_quoter() {
for text; do
printf "'"
while [ -n "$text" ]; do
case "$text" in
*"'"*)
printf "%s" "${text%%\'*}"
printf "''"
text=${text#*\'}
;;
*)
printf "%s" "$text"
text=""
;;
esac
done
printf "' "
done
}
quoters="
single_sed_quoter
single_flat_sed_quoter
multi_sed_quoter
single_builtin_quoter
multi_builtin_quoter
"
for each in $quoters; do
test_quoter "$each" && benchmark_quoter "$each"
done
single_sed_quoter
is basically my original kakquote
implementation above.
single_flat_sed_quoter
is the same thing, but squished into a single line, as suggested by lenormf.
multi_sed_quoter
is basically alexherbo2's quote-arguments-individually implementation.
single_builtin_quoter
implements quoting using only shell-builtins instead of launching sed as an external process. I know that launching external processes was frowned on back in the day, but I don't know how much of an impact it has on a modern system, so I figured I'd try it.
multi_builtin_quoter
is the builtin quoting algorithm running once-per-argument (like alexherbo's suggestion).
Note that test_quoter()
only tests the single-argument case, so it should work properly for both $@
and $*
quoters. benchmark_quoter()
provides multiple arguments, so quoters that do per-argument work should be slower.
Here's the results for whatever version of dash
Debian Testing is currently using (this is an edited screenshot; each result is the best of 3 runs):
single_sed_quoter 1696
single_flat_sed_quoter 1391
multi_sed_quoter 597
single_builtin_quoter 61700
multi_builtin_quoter 42153
busybox sh
is similar (again, each result is the best of 3):
single_sed_quoter 1645
single_flat_sed_quoter 1350
multi_sed_quoter 573
single_builtin_quoter 35660
multi_builtin_quoter 23224
bash
is quite a bit slower (again, each result in the best of 3):
single_sed_quoter 1424
single_flat_sed_quoter 978
multi_sed_quoter 470
single_builtin_quoter 15771
multi_builtin_quoter 10567
My original single_sed_quoter
function is pretty similar everywhere, interestingly enough.
single_flat_sed_quoter
is a bit slower, which surprises me, but I guess it's doing an additional string concatenation in RAM instead of just writing bytes to stdout?
multi_sed_quoter
invokes sed
three times instead of one, and runs at about a third the speed of the single quoter. That's reasonable, but it's probably not the implementation we want in the prelude.
single_builtin_quoter
is an ugly implementation, but wow is it fast. Turns out avoiding process forks is still pretty important! It ranges from 11.0 times as fast as single_sed_quoter
, up to 36.4 times as fast!
multi_builtin_quoter
is about two thirds of the speed of single_builtin_quoter
, although I bet that depends heavily on the number and size of the arguments. Still, it doesn't slow down as much as multi_sed_quoter
slows down, and it's still 7.4 to 24.9 times as fast as the fastest sed
-based quoter!
Overall, I think multi_builtin_quoter
is the right choice for the prelude - it's pretty fast, and it allows the very pleasant kakquote echo -markup {Error}"$msg"
style. It's too ugly for me to copy/paste into every %sh{}
block, so it would really benefit from being in the prelude.
Note that printf
can be a binary on some systems (I think some BSD), so relying on it being a built-in with a zero-overhead cost can backfire.
If the numbers above are the amount of times a given function has run in 5s, then here is the time spent on one iteration (first dataset only):
single_sed_quoter 2.95ms
single_flat_sed_quoter 3.59ms
multi_sed_quoter 8.37ms
single_builtin_quoter 0.08ms
multi_builtin_quoter 0.12ms
multi_sed_quoter
is not an implementation being considered because it spawns way too many processes, so we can discard it.
The other implementations all take less than 5ms to run, so knowing which one is N times faster than others is not relevant, because a single iteration of those implementations represents a negligible time loss, considering we're in the shell world.
single_flat_sed_quoter
looks better to me because it can be copy-pasted easily in scripts.
I'd argue that ~5ms is much too expensive for something as trivial as escaping a string. Consider that it's not uncommon to generate option lists and to have to escape each element.
OpenBSD's sh is the only platform shell that I've seen which doesn't have printf as builtin, and it would be trivial (although not obvious) for an OpenBSD user to set KAKOUNE_POSIX_SHELL to dash
or any other shell.
In addition, forking on windows (and on mac to a lesser degree) is much more expensive than on linux.
I have some new measurements. I was pointed to @occivink's quoting function in his kakoune-snippets package, which double-escapes apostrophes for some reason, but I edited it to match the other implementations:
occivink_builtin_quoter()
{
rest="$*"
printf "'"
while :; do
beforequote="${rest%%"'"*}"
if [ "$rest" = "$beforequote" ]; then
printf %s "$rest"
break
fi
printf "%s''" "$beforequote"
rest="${rest#*"'"}"
done
printf "' "
}
I also figured out how to escape a value entirely within sed
... which is not as nice as doing it entirely outside of sed, but it can fit on a single line:
single_all_sed_quoter() {
printf "%s" "$*" | sed -e "s/'/''/g; 1s/^/'/; \$s/\$/' /"
}
multi_all_sed_quoter() {
for arg; do
printf "%s" "$arg" | sed -e "s/'/''/g; 1s/^/'/; \$s/\$/' /"
done
}
Results for these new implementations, on the same laptop as before, best of 3 runs:
single_all_sed_quoter 1700
multi_all_sed_quoter 585
occivink_builtin_quoter 61013
single_all_sed_quoter
performs identically to single_sed_quoter
, but fits into 80 chars if you use a shorter function name, which makes it nice for non-prelude use as lenormf points out.
multi_all_sed_quoter
performs identically to multi_sed_quoter
.
occivink_builtin_quoter
performs just a little bit slower than single_builtin_quoter
, which I found interesting. When you use builtins to slice a string up to the next apostrophe and there is no apostrophe, you get back the original string rather than an empty string, so you have to treat the last chunk of text specially. occivink tested to see if the slice matched the unsliced value, while I used a case
to search for the string. I honestly wasn't sure which would be faster.
Apparently once you have a benchmarking harness, you can't stop coming up with new variants to benchmark. I have some new variations based on single_builtin_quoter
, with the following changes:
while [ -n "$text" ]
I have while true
and a break
, avoiding a testprintf "%s" blah; printf "''"
I have printf "%s''" blah
, saving a printf"'"
I match \'
, which seems to be very slightly fasterThe new versions look like this:
single_builtin_quoter_all_backslashes_ntmp() {
text="$*"
printf "'"
while true; do
case "$text" in
*\'*)
printf "%s''" "${text%%\'*}"
text=${text#*\'}
;;
*)
printf "%s' " "$text"
break
;;
esac
done
}
multi_builtin_quoter_all_backslashes_ntmp() {
for text; do
printf "'"
while true; do
case "$text" in
*\'*)
printf "%s''" "${text%%\'*}"
text=${text#*\'}
;;
*)
printf "%s' " "$text"
break
;;
esac
done
done
}
(the "ntmp" in the name is short for "no test, merged printfs")
Benchmarking them (as before):
single_builtin_quoter_all_backslashes_ntmp 82609
multi_builtin_quoter_all_backslashes_ntmp 61654
Converting to ms for comparison with lenormf's table, that's:
single_builtin_quoter_all_backslashes_ntmp 0.06ms
multi_builtin_quoter_all_backslashes_ntmp 0.08ms
Basically, the new multi-quoter is as fast as the old single-quoter, and the new single-quoter is even faster.
My current position:
multi_builtin_quoter_all_backslashes_ntmp
single_all_sed_quoter
is easy to copy/paste and the performance is decent.@Screwtapello Does it slow it down to make the implementation a bit more readable?
# Prototype:
# kak_escape [text…]
#
# Description:
# Similar to `shell_escape` you may find in other programming languages,
# `kak_escape` escapes each argument so that it can be safely passed to Kakoune.
#
# Implementation:
# Single quotes each argument and doubles the single quotes inside.
#
# Note:
# The resulted text should be used unquoted and is not intended for use in double quotes, nor in single quotes.
#
# Example:
# kak_escape evaluate-commands -try-client "$KAKOUNE_CLIENT" 'echo Tchou' | kak -p "$KAKOUNE_SESSION"
#
kak_escape() {
for text do
printf "'"
while true; do
case "$text" in
*"'"*)
head=${text%%"'"*}
tail=${text#*"'"}
printf "%s''" "$head"
text=$tail
;;
*)
printf "%s' " "$text"
break
;;
esac
done
done
printf '\n'
}
It does slow things down a bit, although you'll have to decide whether it's significant enough to worry about.
multi_builtin_quoter_all_backslashes_ntmp 61000
alexherbo2_readable_version 56514
alexherbo2_all_backslashes 57552
multi_builtin_quoter_all_backslashes_ntmp
is the same fast implementation as before, although the score is a little different because it always varies from run to run.
alexherbo2_readable_version
is the kak_escape()
you pasted, unmodified.
alexherbo2_all_backslashes
is the same thing, but with each instance of "'"
replaced with \'
. It's consistently about 1000 iterations per second faster in each of the three trials I ran. It's still 3000-4000 iterations slower than the fastest multi-quoter.
@Screwtapello Do you think a final newline character should be added to allow multiple invocations of the function?
Example:
kak_escape set-register / "$regex"
kak_escape echo -markup "$message"
Printing a final newline makes that use-case much more convenient, and it wouldn't prevent people from saying $(kak_escape blah)
(since $()
trims the trailing newline). Is there some other situation here a trailing newline would cause problems? I can't think of one, but I also can't convince myself it's impossible.
I've been using multi_builtin_quoter_all_backslashes_ntmp()
as kakquote()
in my wiki plugin, I'll try adding a trailing newline and see how it goes.
This escaping is so common. I could imagine an helpers.kak
file in rc dir, which could be loaded as a plugin. I do not think that would be much of an overhead. Is it?
That would be much simpler than paste it to every single script.
I think you can get even more performance by building up the string incrementally and then printf
-ing at the end (multi_builtin_quoter_allbkslash_nt1p
):
out=
for text; do
out=$out"'"
while :; do
case "$text" in
*\'*)
out=$out${text%%\'*}\'\'
text=${text#*\'}
;;
*)
out=$out$text"' "
break
;;
esac
done
done
printf '%s' "$out"
# dash
single_all_sed_quoter 1444
multi_builtin_quoter_all_backslashes_ntmp 119916
multi_builtin_quoter_allbkslash_nt1p 141666
# ash
single_all_sed_quoter 1447
multi_builtin_quoter_all_backslashes_ntmp 113750
multi_builtin_quoter_allbkslash_nt1p 135833
# busybox sh
single_all_sed_quoter 3225
multi_builtin_quoter_all_backslashes_ntmp 68463
multi_builtin_quoter_allbkslash_nt1p 81351
# bash
single_all_sed_quoter 1054
multi_builtin_quoter_all_backslashes_ntmp 28640
multi_builtin_quoter_allbkslash_nt1p 32585
I put a snippet up (which also accepts --timeout=
, --quoters='q1 q2'
). So (desktop i7, linux 5.10.45):
for s in dash ash 'busybox sh' bash; do
echo "# $s"
curl -s 'https://gitlab.com/kstr0k/mru-files.kak/-/snippets/2137556/raw/main/kak_quoter_benchmark.sh' |
($s -s -- --timeout=3)
done
Wow. I was pretty sure I'd squeezed all the juice out of that function already, but clearly not! I had assumed that printing was cheap (because stdout buffering) and string concatenation was expensive (because O(N²)) but apparently it's the other way around in this use-case. That's good to know!
The devil with O(n)
is in the hidden constant, as they say. It might be very different for longer strings (and there's also a 2x memory cost). In that case, however, it's probably a lot faster to call sed
than to iterate in shell.
But actually it gets even more interesting. By pre-collapsing the strings a-la occivink, and avoiding variables completely (sh
has $1
, $2
etc which can be assigned via set --
):
multi_builtin_quoter_allbkslash_nt1pset() {
set -- "$*" '' # $1=current $2=accumulator
while :; do
case "$1" in
*\'*)
set -- "${1#*\'}" "$2${1%%\'*}"\'\'
;;
*)
printf '%s' "'$2$1' "
break
;;
esac
done
}
The results are some 50% faster compared to _ntmp
:
# dash
single_all_sed_quoter 2810
multi_builtin_quoter_all_backslashes_ntmp 117752
multi_builtin_quoter_allbkslash_nt1p 137121
multi_builtin_quoter_allbkslash_nt1pset 176301
# ash
single_all_sed_quoter 2825
multi_builtin_quoter_all_backslashes_ntmp 119179
multi_builtin_quoter_allbkslash_nt1p 137391
multi_builtin_quoter_allbkslash_nt1pset 178956
# busybox sh
single_all_sed_quoter 5528
multi_builtin_quoter_all_backslashes_ntmp 65136
multi_builtin_quoter_allbkslash_nt1p 77243
multi_builtin_quoter_allbkslash_nt1pset 106483
# bash
single_all_sed_quoter 2074
multi_builtin_quoter_all_backslashes_ntmp 24680
multi_builtin_quoter_allbkslash_nt1p 28151
multi_builtin_quoter_allbkslash_nt1pset 37910
# zsh --emulate sh
single_all_sed_quoter 2213
multi_builtin_quoter_all_backslashes_ntmp 25533
multi_builtin_quoter_allbkslash_nt1p 29914
multi_builtin_quoter_allbkslash_nt1pset 37989
... but of course, multi_*_nt1pset
is actually a misnamed single-quoter, so I was comparing apples to coconuts. It's still some 10% faster than single_*_ntmp
in most shell AFAICT. Sorry about that.
Since Kakoune's quoting system was reworked, it's pretty easy to reliably quote Kakoune strings by just doubling apostrophes. In shell, it looks something like this:
However, while working on #3336, I've copied and pasted this fragment into three or four shell blocks already; I expect it (or something like it) is already present in a bunch of other scripts too.
We already inject environment variables into shell blocks; how about we prepend this helpful and near-universally-useful function as well?
(it should be possible to implement this in pure POSIX shell without
sed
, but it would be too complex. If avoiding a fork turns out to be a performance improvement, it'd be nice to make the change once in Kakoune's shell prelude, and not have to copy/paste the change into thousands of existing Kakoune scripts)