romkatv / zsh-bench

Benchmark for interactive Zsh
MIT License
592 stars 27 forks source link

Prompt times don't match the real times in a screen recording, and zsh4humans seem to be getting quite a good advantage #5

Closed ericbn closed 2 years ago

ericbn commented 2 years ago

It's hard to be sure how exactly zsh-bench is measuring what it claims to be measuring, by looking at its code.

I'm going to trust a screen recording as the source of truth for the prompt times in my machine. I've recorded zim and zsh4humans, 3 times each. ​Each time the steps are:

  1. Sleep 1 second. In this time I type exit and ENTER, so it's used as input in the next steps.
  2. Print the framework name. I consider the counting starts when the output of this print appears in the recording.
  3. Start a new shell with HOME as the specific framework installation dir. I consider the counting ends when the prompt appears.

This translates to: for f in zim zsh4humans; do repeat 3 do; sleep 1; print ${f}; HOME=${PWD:h}/${f} zsh -li; done; done

The steps above are executed in a dir with a git repo with 1,000 directories and 10,000 files, set up using the code here and here.

Each framework was installed with the setup script here and here, respectively for zim and zsh4humans.

This should be enough to guarantee that the recordings are using the same scenario used by zsh-bench.

This is the screen recording:

Kapture 2021-10-24 at 16 54 36

This is the times extracted from checking the recording frames:

begin_frame prompt_frame frames ms
zim 1 161 169 9 300.000
zim 2 200 207 8 266.667
zim 3 238 246 9 300.000
zsh4humans 1 276 282 7 233.333
zsh4humans 2 316 321 6 200.000
zsh4humans 3 354 359 6 200.000

The recording has 30fps, so each frame represents 33.333ms. We can consider the error of each measurement above to be +/- 33.333ms.

This is what I get when running zsh-bench on my same machine. Ran 3 times each of the mentioned frameworks:

❯ ./zsh-bench zim
==> setting up a container for benchmarking ...
==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=115.790
first_command_lag_ms=149.910
command_lag_ms=62.535
input_lag_ms=28.830
exit_time_ms=47.530

❯ ./zsh-bench zim
==> setting up a container for benchmarking ...
==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=196.126
first_command_lag_ms=229.765
command_lag_ms=142.110
input_lag_ms=29.402
exit_time_ms=48.146

❯ ./zsh-bench zim
==> setting up a container for benchmarking ...
==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=196.878
first_command_lag_ms=231.291
command_lag_ms=140.756
input_lag_ms=27.556
exit_time_ms=48.560

❯ ./zsh-bench zsh4humans
==> setting up a container for benchmarking ...
==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=30.361
first_command_lag_ms=100.323
command_lag_ms=4.997
input_lag_ms=14.536
exit_time_ms=10.895

❯ ./zsh-bench zsh4humans
==> setting up a container for benchmarking ...
==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=30.458
first_command_lag_ms=101.612
command_lag_ms=5.198
input_lag_ms=12.339
exit_time_ms=11.051

❯ ./zsh-bench zsh4humans
==> setting up a container for benchmarking ...
==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=29.819
first_command_lag_ms=108.122
command_lag_ms=5.123
input_lag_ms=13.146
exit_time_ms=11.05

What is odd

zim first_prompt_lag_ms times in zsh-bench fluctuate quite a lot. It's an average of 169.598ms with stdev of 46.601! In the recordings the average was 288.889ms with a stdev of 19.245, which was actually a stdev of 0.577 in terms of frames (it's just one frame more in one recordings). zsh-bench should be more precise than a 30fps recording.

Also from zsh-bench's code looks like this output is the min value. It would be good to know what is the stdev of values inside a zsh-bench run, to make sure the values are not fluctuating.

I could compare the minimum recording time with the minimum zsh-bench time for each framework, but let's use the averages since the stdev of zsh-bench for zim was so high. So for completeness of information, the first_prompt_lag_ms average for zsh4humans is 30.213ms, with a stdev of 0.344. And in recordings the average was 244.444ms (stdev was 50.918, but again it was actually just one frame more in one of the recordings).

Then, it's odd that zim is 1.703x faster in zsh-bench than the recordings -- I was expecting much closer values --, and zsh4humans is 8.091x faster in zsh-bench than the recordings!

If anything, zsh-bench if favoring zsh4humans by a lot, both in terms of more stable measurements, than giving it times way faster than the real ones.

romkatv commented 2 years ago

If anything, zsh-bench if favoring zsh4humans by a lot

To my knowledge, it does not.

The procedure you've used for benchmarking is underspecified, so I cannot replicate your results. It also doesn't model real-life use of zsh. Starting login shell and immediately closing it is quite unusual, and so is overriding HOME. It's cool that you want to validate zsh-bench results, and I'd love to see that happening, so let's figure out a way to do it right.

It's vital to perform all measurements in a container to allow everyone to create the same environment. I've created a script that spins up a docker container and sets up zsh in it. You can use it like so:

curl -fsSLO https://raw.githubusercontent.com/romkatv/zsh-bench/master/playground
chmod +x ./playground
./playground zim

Instead of zim you can pass zsh4humans or any other name of a predefined config.

After a few minutes you should have interactive zsh with ~/repo as the current working directory. This is a git repository with 1k directories and 10k files.

You can install additional software with sudo apt install <pkg>.

If you want to record a video of starting a login zsh, use this command:

clear; print start; zsh -li

Once "start" appears on the screen, start the timer. Once "repo" (the current directory) appears, stop it. That's first prompt lag.

If you decide to base your benchmark results on a video recording, please specify the exact command that everyone can use to record video the same way. Any non-reproducible step in your methodology will void the results, so be precise.

Instead of video you can record raw output of zsh to the TTY complete with timing of all writes. This is what zsh-bench does. Here's how you can start a login shell and capture its writes to the TTY:

clear; script -fqet$HOME/timing ~/typescript

Run a few commands and exit zsh with exit as usual. Afterwards you can look at the raw output to the TTY:

cat -v ~/typescript | less

And the timing data:

less ~/timing

The format of the timing data is documented in man script:

The log contains two fields, separated by a space. The first field indicates how much time elapsed since the previous output. The second field indicates how many characters were output this time.

You can replay a recorded session like this:

clear; scriptreplay -t ~/timing ~/typescript

Don't resize your TTY or it won't replay properly.

I would very much prefer if you share ~/typescript and ~/timing instead of a video. They are much easier to work with.

You can also try using asciinema instead of script to record and replay a session. I don't know what kind of latency it adds on its own but in terms of capabilities it should work just as well as script. The advantage of asciinema is its output format -- a single json file with data and timing. In addition, asciinema sessions can be easily uploaded and watched in a browser.

You can run zsh-bench in the container like this:

~/zsh-bench/zsh-bench --iters 16

All benchmark data that I've published was acquired with --iters 64 but it takes a long time to run. With --iters 16 results are also decent.

You can edit rc files if you need to test something (install nano or vim for this). Restart zsh with exec zsh afterwards. ~/zsh-bench/zsh-bench will be affected by these changes, too.

If you find anything unexpected, or if your measurements match that of zsh-bench, please share either way.

Also from zsh-bench's code looks like this output is the min value.

Correct.

It would be good to know what is the stdev of values inside a zsh-bench run...

If you send a clean PR, I'll merge it.

... to make sure the values are not fluctuating.

If you take a minimum of several runs, these fluctuations won't matter.

let's use the averages

Minimum is better in benchmarks of this kind, especially when ran on commodity hardware with CPU scaling, background services and whatnot. Minimum fails only when workloads are different on different iterations but in these benchmarks they are the same.

romkatv commented 2 years ago

Here are sample results from my dev machine. I've abbreviated all prompts to %.

Machine info:

% grep 'model name' /proc/cpuinfo | head -1
model name      : AMD Ryzen Threadripper 3970X 32-Core Processor

% uname -sm
Linux x86_64

Setup:

% curl -fsSLO https://raw.githubusercontent.com/romkatv/zsh-bench/master/playground
% chmod +x ./playground

Zim:

% ./playground zim
==> starting ubuntu:impish in docker ...
==> setting up a container ...

% ~/zsh-bench/zsh-bench --iters 64
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=116.601
first_command_lag_ms=133.514
command_lag_ms=96.829
input_lag_ms=11.980
exit_time_ms=28.159

% clear; script -fqet$HOME/timing ~/typescript

% exit  # exit from `script`

zsh4humans:

% ./playground zsh4humans
==> starting ubuntu:impish in docker ...
==> setting up a container ...

% ~/zsh-bench/zsh-bench --iters 64
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=14.943
first_command_lag_ms=55.096
command_lag_ms=2.366
input_lag_ms=4.725
exit_time_ms=5.969

% clear; script -fqet$HOME/timing ~/typescript

% exit  # exit from `script`

Script logs:

script-logs.tar.gz

Note that zsh4humans starts tmux, as indicated by creates_tty in the output of zsh-bench. You can disable this with the following command:

sed -i -E '/start-tmux/ s/integrated/no/' ~/.zshrc
ericbn commented 2 years ago

The procedure you've used for benchmarking is underspecified, so I cannot replicate your results.

The only thing I haven't specified is what I've used to do the screen recording. I've used Kap, but any screen recording software will do.

It also doesn't model real-life use of zsh. Starting login shell and immediately closing it is quite unusual, and so is overriding HOME.

Starting a login shell and immediately closing it is enough to measure the time for the first prompt to appear. The prompt should not take a different time to appear because the first command after it is exit. I'm starting zsh with HOME=/path/to/new/home zsh -il to have a shell with a different home, which can be unusual because we usually don't need a different home, but is perfectly valid. You'll get a perfect "real-life" zsh shell with a new home with that, which is what is needed to benchmark the different frameworks.

It's cool that you want to validate zsh-bench results, and I'd love to see that happening, so let's figure out a way to do it right.

It does not look like you'd love to see that happening, as you didn't even bother to understand how to replicate my results, and already tried to minimize what I did in every sentence of your paragraph above.

Once "start" appears on the screen, start the timer. Once "repo" (the current directory) appears, stop it. That's first prompt lag.

That's exactly what I did. I just printed the framework name instead of "start", which I hope you agree should not make any difference.

Minimum is better in benchmarks of this kind, especially when ran on commodity hardware with CPU scaling, background services and whatnot.

This is just because you say so. You find it very easy to say anything just to keep your current position.

EDIT: Actually it's recommended to compare all data points instead of just the average. See Don't Compare Averages. And before you say the article title means that it's better to use minimum instead of averages, it does not. The article starts with "In business meetings, it's common to compare groups of numbers by comparing their averages", and then goes on from there.

ericbn commented 2 years ago

Note that zsh4humans starts tmux, as indicated by creates_tty in the output of zsh-bench.

You can start whatever you want in you framework, and whatever runs as part of the shell startup until the first prompt appears should be counted as the time from the shell start to the first prompt appearance.

romkatv commented 2 years ago

The only thing I haven't specified is what I've used to do the screen recording. I've used Kap, but any screen recording software will do.

In order to replicate your results I need the full list of commands that I can run on a freshly installed OS.

Starting a login shell and immediately closing it is enough to measure the time for the first prompt to appear. The prompt should not take a different time to appear because the first command after it is exit.

If a benchmark doesn't measure real life performance, it's a problem with the benchmark. You've added sleep 1 to zim benchmark for the same reason. Starting zim in a tight loop has terrible performance but it doesn't matter because in real life no one starts login shell in a tight loop. You don't need to "fix" zim in this regard -- it's not broken. So fixing the benchmark was the right call. We want to measure real life performance, so we adapt the benchmark until it starts measuring real life performance. We don't change real life to adapt to a benchmark.

I'm starting zsh with HOME=/path/to/new/home zsh -il to have a shell with a different home, which can be unusual because we usually don't need a different home, but is perfectly valid. You'll get a perfect "real-life" zsh shell with a new home with that, which is what is needed to benchmark the different frameworks.

The closer you get to real life usage, the less likely that you are measuring wrong things. In real life people don't override HOME, so it's better to not do that. If you can demonstrate that overriding HOME never affects benchmark results, then you can override HOME but it makes the benchmark more brittle. If my benchmark doesn't override HOME and yours does, and our results disagree, my results are more authoritative because my benchmark better models real life usage. Let's stay as close as possible to real life and let's specify benchmark conditions as fully and precisely as possible. It's much easier to argue about code when we both can run it than it is to argue about words. Easier to argue implies easier to sort out differences and come to an understanding of where the discrepancies in the measurements lie.

It does not look like you'd love to see that happening, as you didn't even bother to understand how to replicate my results, and already tried to minimize what I did in every sentence of your paragraph above.

For the sake of efficiency, let's stay on the subject matter and avoid speculations about each other's motivation, character, attitude, etc.

I'm willing to replicate your results if you give me instructions as specific as the ones I gave you. It should be literally "start with a freshly installed OS and run this series of commands". All commands must have their source available. I would greatly appreciate if the OS is also free and open source -- Linux, FreeBSD, etc. If I get different results when I run the same commands as you, we'll narrow it down to things we've failed to control (machine, cosmic rays, etc.).

Another way to do this is to start with the specific commands that we already have -- the ones I posted in my previous comment. I can run them and you can run them. You don't have to accept the numbers printed by zsh-bench as authoritative. Go ahead and record TTY on video, or via script, or via asciinema. If you find a mismatch in numbers by using a different methodology, post the instructions for me to replicate your measurements (ideally packaged as a script). It would save a lot of time if you start with the container I've created but if you want to create your own, that's also fine. Just post the code so that I can have the same container on my machine.

EDIT: Actually it's recommended to compare all data points instead of just the average. See Don't Compare Averages. And before you say the article title means that it's better to use minimum instead of averages, it does not. The article starts with "In business meetings, it's common to compare groups of numbers by comparing their averages", and then goes on from there.

Let's postpone this discussion. It's significantly less important but it can derail us.

You can start whatever you want in you framework, and whatever runs as part of the shell startup until the first prompt appears should be counted as the time from the shell start to the first prompt appearance.

This is obviously true. I'm probably missing a point you are trying to make here.

romkatv commented 2 years ago

After reading the issue description a few more times here's my theory why you see the discrepancy in first prompt lag as reported by zsh-bench and as measured by yourself.

Firstly, when you run ./zsh-bench zim, it runs in docker, hence on Linux kernel. Performance on Linux can be very different from performance on macOS (I'm assuming your host OS is macOS). In addition, zsh startup files often take different code path on different operating system. They are also affected by the availability of external commands installed on the machine. I think if you run this:

curl -fsSLO https://raw.githubusercontent.com/romkatv/zsh-bench/master/playground
chmod +x ./playground
./playground zim

... and then run clear; print start; zsh -li while recording your screen, you'll see agreement with the numbers from ~/zsh-bench/zsh-bench --iters 16 (you need to run the latter command also in the container). Those numbers will also be lower than what you seen on your macOS host. I highly recommend avoiding buffering input before starting shell. It doesn't improve the quality of the benchmark at all and only introduces complexity. Type exit after you see prompt.

There is also the issue with initialization of zsh in the procedure you've used. I already mentioned it in https://github.com/zimfw/zsh-framework-benchmark/pull/8#issuecomment-945638544 (the second bullet point). In order to let zsh4humans initialize, you need to start it at least once and let zle enter idle state. In practice this happens naturally because people aren't sending buffered input with exit in it to interactive shell so that zsh terminates before it has a chance to read all of the input.

The most reliable way to warmup zsh after installing a config is by the following sequence:

% HOME=HOME=${PWD:h}/zsh4humans -li
% :
% exit  # run this 1s after the previous command completes

zsh-bench does this automatically. After this you can run your benchmark as before although I once again should mention that buffering "exit" as input isn't achieving anything useful.

Edit: When you are running zsh-bench with positional arguments -- as in zsh-bench zim -- there is no need to run it several times because this command already performs multiple iterations (you can specify how many with --iters N). You can also specify multiple configs at once: zsh-bench zim zsh4humans. It'll be faster because zsh-bench will set up the container only once. It'll still provide great isolation because zsh-bench deletes the test users and wipes /tmp before each benchmark. playground doesn't currently allow you to run both zim and zsh4humans in the same container, so you'll need to create a separate container for each.

ericbn commented 2 years ago

In order to replicate your results I need the full list of commands that I can run on a freshly installed OS.

Cool, let me work on that. I purposely want to keep the steps where the screen is recorded and the recording frames are verified as manual steps, as I want to make sure no automation is interfering with the results.

Performance on Linux can be very different from performance on macOS (I'm assuming your host OS is macOS).

Yes, I'm using macOS, and yes there seems to be more going on in the /etc startup files in macOS than in Ubuntu, which might explain why zsh-bench reports a quicker time than the recording I did in my macOS. What do you think about running the benchmarks with zsh -o NO_GLOBAL_RCS -li in the future?

while recording your screen, you'll see agreement with the numbers from ~/zsh-bench/zsh-bench --iters 16 (you need to run the latter command also in the container).

Sure. Let me use your playground container, or another container as close to it as possible so the recordings can be equivalent to the expect results from zsh-bench.

In order to let zsh4humans initialize, you need to start it at least once and let zle enter idle state.

Good to know. I'll make sure to do that as the first step before actually doing the frames counting in the recordings.

there is no need to run [zsh-bench] several times because it already performs multiple iterations

I know. I just wanted to show how the results fluctuate between subsequent zsh-bench zim runs. And in comparison, my subsequent recording times didn't fluctuate as much.

I highly recommend avoiding buffering input before starting shell. It doesn't improve the quality of the benchmark at all and only introduces complexity. Type exit after you see prompt.

I thought you were measuring the first command lag like this: Type the command before the prompt appears, and measure how long it takes for the command output to actually appear after the prompt appears. (If we type it after the prompt appears, there's already a lag between the time the prompt appeared and the time the command was typed, at least when typing manually.) If not, then how are you measuring it? And which command are you specifically using in zsh-bench?

romkatv commented 2 years ago

I purposely want to keep the steps where the screen is recorded and the recording frames are verified as manual steps, as I want to make sure no automation is interfering with the results.

I'd rather have both of these automated and then we can also manually perform the same steps to double-check that the automation produces the same results.

(Frankly, I don't think recording a video is a wise approach. Recording the raw writes to the TTY is both easier and more precise. But if you want to go with video, that's fine with me. It's more work but the numbers should match eventually after we iron our potential quirks with vsync, adaptive frame rate, codecs, buffering, etc.)

Performance on Linux can be very different from performance on macOS (I'm assuming your host OS is macOS).

Yes, I'm using macOS, and yes there seems to be more going on in the /etc startup files in macOS than in Ubuntu, which might explain why zsh-bench reports a quicker time than the recording I did in my macOS.

You can also run zsh-bench without positional arguments to benchmark your own local login shell. You'll have to install script from util-linux though and I don't know how easy or hard that is on macOS. If you manage to instal it, create a local user, set up any kind of zsh environment you like and run zsh-bench to see how that shell performs on macOS. You can also use your regular user if you have a way to restore your rc files (e.g., from git).

What do you think about running the benchmarks with zsh -o NO_GLOBAL_RCS -li in the future?

The config called "zim" is for the default experience. It measures the performance of zim that a potential user would see if they installed zim and made no modifications. Similarly with "zsh4humans". The config called zim+ disables everything -- including global rcs -- and enables only the core functionality (what's considered core is in the docs of zsh-bench) with all possible optimizations. It shows the baseline latency for a config that one could build on top of zim. If zsh-bench ran all configs with no_gobal_rcs, it wouldn't be able to benchmark some configs that people actually use. E.g., it wouldn't be able to tell us what kind of performance people should expect if they installed zim or zsh4humans and used it without modifications.

I know. I just wanted to show how the results fluctuate between subsequent zsh-bench zim runs.

It looks like the first run was slow and the second two runs had identical first prompt lag. Perhaps something was running on your machine at the same time? Try using --iters 64, close all applications and don't do anything while the benchmark is running. As I mentioned, this is the value I used for all benchmarks for all published results.

I thought you were measuring the first command lag like this: Type the command before the prompt appears, and measure how long it takes for the command output to actually appear after the prompt appears. (If we type it after the prompt appears, there's already a lag between the time the prompt appeared and the time the command was typed, at least when typing manually.) If not, then how are you measuring it? And which command are you specifically using in zsh-bench?

Yes, that's how zsh-bench measures first command lag. But you aren't measuring first command lag, you are measuring first prompt lag. If you want to measure first command lag, you need to buffer print first""-""command and then track when "first-command" appears on the screen. Note that first command lag depends on the length of the command if syntax highlighting is enabled. Buffering "exit" doesn't allow you to measure first command lag and doesn't help with measuring first prompt lag.

I forgot to mention that you can replay the screen of the TTY that zsh-bench is acting on. Let's say you are in playground zim. Run this:

rm -rf -- '' /tmp/zsh-bench-*(D) && ~/zsh-bench/zsh-bench --iters 1 --keep

Edit: I've added the crucial --keep to the command above.

Once this command finishes, you can see what zsh-bench was doing:

clear; scriptreplay -t /tmp/zsh-bench-*/timing /tmp/zsh-bench-*/out

This works only if your TERM is compatible with xterm-256color. You can multiply and/or increase by a constant all numbers in the first column of /tmp/zsh-bench-*/timing for slower replay.

This method is very useful when debugging zsh configs. It allows you to see errors if there are any and to generally verify that things work as they are supposed to.

You'll see that at first zsh-bench issues a short command that sources something. That's the first command and it gets sent to the TTY right away without waiting for the first prompt. The sourced file prints, and the result of that printing signifies execution of the first command.

Then zsh-bench starts measuring input latency. It prints a fairly long command composed of a bunch of escaped abc tokens (they are escaped to avoid triggering alias expansion, which can significantly slow down syntax highlighting), types one more character and waits for that character to be processed by zle. This is performed several times and then the command line is cleared without execution.

Then zsh-bench measures command lag by spamming ENTER and seeing how many prompts it would get.

Once you play around with script and zsh-bench a bit and verify some numbers with video recording, it shouldn't be hard to see how zsh-bench measures things and that the way it measures it corresponds to what we actually want to measure. You can even run clear; print start; script -fqet$HOME/timing ~/typescript in the container and record a video. This will allow you to verify that latencies extracted from the logs of script match latencies that you get from video.

Eriner commented 2 years ago

Sorry, just catching up on all this and discussion in the other benchmarking repo.

@ericbn My memory is hazy and I don't remember if there is any interaction with login shells when using zpty, which my original script used, and if there is/was it may not be documented.

Awesome that you were able to generate so many stats. Reading all of this was really the best kind of issue report :)

@romkatv this is a really cool tool, although I'll admit I haven't looked deeply into your zsh4humans project. If the claim to fame is that it is fast, awesome! Fixing the horrendously slow prompt lag on slower devices (first generation raspberry pis, old old intel chips, etc) that came with other frameworks was one of my goals when I started zim.

Part of the original design of zim was also to allow the "heavier" tasks to run at the login shell, with the idea that it usually only happens once per interactive session. That is to say, when you sit down and login or connect, you're only going to do it once per session/terminal. This included things like zcompile operations, initialization of the ssh/gpg agents, etc. My perception was, "if the user is forced to wait, make them wait one time and only one time".

In that same line referenced above, you can see that an explicit and seemingly superfluous zsh process is launched with the ZDOTDIR of the to-be-benchmarked framework rc's. This explicitly allowed first-time initialization of each framework (explicitly not benchmarking the first prompt lag). This isn't to say that it doesn't matter, it was just an intentional compromise.

Design obviously wasn't perfect, but was a good compromise vs trying to make things asynchronous until the cows come home. By happenstance, working on my original zsh framework benchmarking tool is where I learned precisely how painful that is ;)

My only question, do you run login shells frequently? If you open a new terminal is it always a login shell? I ask only because it doesn't match my personal experience.

I'd be interested to see these benchmarks run on a variety of devices. There is a huge difference in speed based on IO devices. This is why the original benchmarks for zim ran on a variety of devices. Machines with an NVMe will perform better with a Celeron chip than a machine with an HDD and a Ryzen Threadripper in some circumstances. (Zim was knowingly designed with file IO in mind, hence the zcompile of user RC files that you disagree with). You reference your CPU here, but not the disk: https://github.com/romkatv/zsh-bench#full-benchmark-data

cheers! 🍻

romkatv commented 2 years ago

My only question, do you run login shells frequently?

Not particularly. Why do you ask?

If you open a new terminal is it always a login shell?

Yes. Note that this is the default in Apple Terminal and iTerm2. You can change it but then .zlogin and .zprofile will never get sourced, not even on boot.

I'd be interested to see these benchmarks run on a variety of devices.

Cool. If you run the benchmarks on other devices, please share the results.

You reference your CPU here, but not the disk: https://github.com/romkatv/zsh-bench#full-benchmark-data

It's NVMe M.2. I added this information to the doc.

romkatv commented 2 years ago

I'd be interested to see these benchmarks run on a variety of devices.

Cool. If you run the benchmarks on other devices, please share the results.

Out of curiosity I've benchmarked zim and zsh4humans on a Raspberry Pi 4 4G Model B with a micro SD card for storage.

==> benchmarking zim
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=315.364
first_command_lag_ms=387.934
command_lag_ms=229.512
input_lag_ms=69.796
exit_time_ms=65.726

==> benchmarking zsh4humans
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=61.945
first_command_lag_ms=233.866
command_lag_ms=13.231
input_lag_ms=56.386
exit_time_ms=19.132

Edit: See more accurate results below.

Eriner commented 2 years ago

Edit: the contents below were edited by @romkatv. I, @eriner, did not author any of the text below.

Re: the rpi4 benchmarks, I'd bet the results are even more pronounced for an rpi1

What do you mean by "pronounced"? Which conclusions did you draw from the benchmark results I posted above?

I reran the benchmarks today with proper isolation and --iters 64 to match the conditions of other published benchmark results. This time I also included no-rcs and allowed my pi to cool down between benchmarks to avoid thermal throttling.

Raw:

==> benchmarking no-rcs ...
creates_tty=0
has_compsys=0
has_syntax_highlighting=0
has_autosuggestions=0
has_git_prompt=0
first_prompt_lag_ms=0.028
first_command_lag_ms=0.876
command_lag_ms=0.212
input_lag_ms=0.993
exit_time_ms=10.249

==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=315.910
first_command_lag_ms=386.317
command_lag_ms=227.338
input_lag_ms=69.142
exit_time_ms=65.413

==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=58.987
first_command_lag_ms=222.734
command_lag_ms=12.957
input_lag_ms=27.730
exit_time_ms=19.110

Normalized:

config tmux compsys syntax highlight auto suggest git prompt first prompt lag first cmd lag cmd lag input lag
no-rcs 0%
🟢
1%
🟢
2%
🟢
5%
🟢
zim ✔️ ✔️ ✔️ ✔️ 632%
🔴
258%
🔴
2273%
🔴
346%
🔴
zsh4humans ✔️ ✔️ ✔️ ✔️ ✔️ 118%
🟠
148%
🟠
130%
🟠
139%
🟠

For comparison, here are the results from my desktop:

config tmux compsys syntax highlight auto suggest git prompt first prompt lag first cmd lag cmd lag input lag
no-rcs 0%
🟢
0%
🟢
1%
🟢
1%
🟢
zim ✔️ ✔️ ✔️ ✔️ 252%
🔴
95%
🟡
974%
🔴
61%
🟡
zsh4humans ✔️ ✔️ ✔️ ✔️ ✔️ 20%
🟢
34%
🟢
24%
🟢
24%
🟢

On the pi all latencies are between 2.3 and 5.9 times higher. That's a fairly narrow range considering the differences between these machines on different dimensions: CPU architecture, single core performance, number of cores, I/O, etc. It's worth noting that the pi performs remarkably well.

romkatv commented 2 years ago

Cool, let me work on that.

Hey @ericbn, have you made any progress on that?

Eriner commented 2 years ago

@romkatv I meant that when things are slow, it is more obvious. Rather than just using benchmarks comparing against vanilla/no rcs, running in various real world conditions is more demonstrative of the problem.

That is to say, seeing that it takes 300ms to load a prompt on a Rpi means something to me as a reviewer, whereas just knowing that it takes XXX% longer to load vs vanilla doesn't convey as much practical information (to me, personally).

I've mentioned this to @ericbn in the past (I think, maybe I kept it to myself) but I've always wondered what load times would be like if all sourced files were written into a single file and sourced. Obviously there are practical reasons that make this tricky both in the short-term for the test and long-term for utilization (variable scoping, change management, etc). However, I'd bet that there is a massive speed up purely from concatenating all of the files - much less random read disk IO.

What do you mean by "pronounced"? Which conclusions did you draw from the benchmark results I posted above?

In terms of conclusions, not having experimented myself and going purely by intuition, I'd guess that disk IO is one of the symptoms that zim is manifesting.

Put another way (again knowing nothing about zsh4humans), I'd assume that zsh4humans reads fewer files on disk and has less read IO. And maybe not, please correct me if I'm wrong.

romkatv commented 2 years ago

That is to say, seeing that it takes 300ms to load a prompt on a Rpi means something to me as a reviewer

I guess it tells you exactly what it tells you and nothing more. If the benchmark told you that it takes 100ms on MacBook (or Acer, Dell, or whatever machine you are actually using), would you find this less useful than the data point from Rpi?

just knowing that it takes XXX% longer to load vs vanilla doesn't convey as much practical information (to me, personally).

I agree that presenting benchmark numbers as "XXX% slower than vanilla" is not useful (in zsh-bench "vanilla" is no-rcs). I'm not aware of any benchmark that actually does this.

I've always wondered what load times would be like if all sourced files were written into a single file and sourced.

Now we know.

I'd bet that there is a massive speed up purely from concatenating all of the files

Apparently not.

In terms of conclusions, not having experimented myself and going purely by intuition, I'd guess that disk IO is one of the symptoms that zim is manifesting.

Which symptoms do you mean?

Put another way (again knowing nothing about zsh4humans), I'd assume that zsh4humans reads fewer files on disk and has less read IO. And maybe not, please correct me if I'm wrong.

Under the benchmark conditions zsh4humans reads more files than zim.

Eriner commented 2 years ago

Put yet another way, I don't see a prompt taking a reasonable amount of time to load as a problem. Sub 1s? Acceptable. Sub 500ms? Good. Sub 200ms? Great. Sub 100ms? Amazing.

I had/have a problem with any prompt taking 1s+ to load, as it begins to actually impact productivity. This problem tends to rear its ugly head when using low-specc'ed devices, but not with modern machines.

romkatv commented 2 years ago

My thesis is that everyone prefers software where user actions have instantaneous -- or rather indistinguishable from instantaneous -- effect over software with detectable lag, everything else being equal.

The nice thing about this thesis is that it's very likely true and it doesn't require introducing subjective numbers and grades (500ms -- good, 200ms -- great, etc.).

Eriner commented 2 years ago

My thesis is that everyone prefers software where user actions have instantaneous

Absolutely. "don't make me wait" is something I think everyone agrees with :)

would you find this less useful than the data point from Rpi?

Yes.

Now we know.

What do we know? I was just theorizing out loud - actually building a system like that is completely out of scope for this discussion. Zim doesn't do this.

Which symptoms do you mean?

Latency, however it is measured in this benchmark.

Under the benchmark conditions zsh4humans reads more files than zim.

Really? out of curiosity, what are the numbers? (I still haven't cloned this repo or run any benchmarks yet, again just catching up in this thread). I haven't had time to examine methodology or anything (yet).

romkatv commented 2 years ago

Now we know.

What do we know? I was just theorizing out loud - actually building a system like that is completely out of scope for this discussion. Zim doesn't do this.

Thanks for correcting my mistake. I thought zim did that but now I realize that I confused it with zpm and I don't have a config for it in zsh-bench.

Which symptoms do you mean?

Latency, however it is measured in this benchmark.

Zsh with all configs has latency. I don't think I understand your point.

Under the benchmark conditions zsh4humans reads more files than zim.

Really? out of curiosity, what are the numbers? (I still haven't cloned this repo or run any benchmarks yet, again just catching up in this thread). I haven't had time to examine methodology or anything (yet).

You are commenting on an issue in zsh-bench. Might as well read the homepage.

Eriner commented 2 years ago

You are commenting on an issue in zsh-bench. Might as well read the homepage.

Ha, I've read the homepage and this issue (and skimmed the scripts, but only skimmed). I just didn't see any info about differences between the number of file accesses during benchmarking of each set of scripts. The number of files accesses will also (probably?) be different than just the number of files in the repo, and will also change between login vs non-login shells, zcompile vs no zcompile, autoloads, etc.)

This is primarily to satisfy my own personal curiosity, I'm not positive there is a performance corollary. Just curious is all.

Zsh with all configs has latency. I don't think I understand your point.

Let me rephrase:

If random disk IO is the problem, having a large number of file accesses will exacerbate this problem. Measuring performance on a raspberry pi with a common microSD will show a good measurement for a "worst case scenario". If there is indeed a corollary between file accesses and performance, then it stands to reason frameworks with a larger number of file accesses may be much less performant than those with fewer file accesses.

This may reveal optimization strategies that can be used to mitigate a common underlying problem: slow random disk read IO.

romkatv commented 2 years ago

I just didn't see any info about differences between the number of file accesses during benchmarking of each set of scripts.

zsh-bench measures only things users care about. That is, something that can be observed. It's a benchmark, not a profiler.

This is primarily to satisfy my own personal curiosity, I'm not positive there is a performance corollary. Just curious is all.

Shouldn't be to hard to find out.

romkatv commented 2 years ago

If random disk IO is the problem, having a large number of file accesses will exacerbate this problem. Measuring performance on a raspberry pi with a common microSD will show a good measurement for a "worst case scenario".

I posted the benchmark results from rpi4 in this thread together with my interpretation. Basically, all latencies are higher by a similar factor and this factor is small. So it stands to reason that we don't gain anything from benchmarking on the pi, so we might as well benchmark on hardware that is actually popular so that we can optimize for the larger number of users.

Did you interpret the benchmark results differently?

Eriner commented 2 years ago

Shouldn't be to hard to find out.

eh, I think it's going to be tougher than one may think. It may be easier to sniff syscalls than it will be to use zsh.

Did you interpret the benchmark results differently?

Yes. It is a reproducible environment. It is also a "worst case scenario"; any benefits measure on an RPI will carry over to any other machine.

Again, I suspect disk IO is the problem, so swapping various microSDs (all else equal) may tell more of a story. I'll get around to running the benchmarks here, but I'll have to review the code first. Pretty sure I have microsds of varying quality laying around...

Eriner commented 2 years ago

Run on a random dedicated KVM slice I lease:

==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=93.866
first_command_lag_ms=117.745
command_lag_ms=48.662
input_lag_ms=17.192
exit_time_ms=29.266

==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=20.394
first_command_lag_ms=80.346
command_lag_ms=3.368
input_lag_ms=6.290
exit_time_ms=7.009

As I have stated earlier, I suspect random reads from a slow-performing disk are a major contributing factor to the slowness, thus here are some tests that drop disk caches at varying intervals. Note that these aren't a great simulation of slow IO, just the easiest method I could come up with (without swapping physical disks).

while true; do echo 1 > /proc/sys/vm/drop_caches && sleep $_TIME; done

_TIME=3 (cache drop too slow to run for all invocations):

==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=96.179 (+~3ms, insignificant)
first_command_lag_ms=124.296 (+~8ms, insignificant)
command_lag_ms=52.243 (+~2ms, insignificant)
input_lag_ms=16.392
exit_time_ms=65.989

_TIME=1 (cache drop still too slow, but getting there...):

==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=150.638 (+~50ms)
first_command_lag_ms=177.658 (+~50ms)
command_lag_ms=54.173 (+~4ms, insignificant)
input_lag_ms=18.375
exit_time_ms=65.728

_TIME=.1 (should drop between most iterations):

==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=164.780 (+~65ms)
first_command_lag_ms=202.402 (+~80ms)
command_lag_ms=66.711 (+~12ms, still not impactful, but statistically significant)
input_lag_ms=17.595
exit_time_ms=62.218

_TIME=0.05 (should be guaranteed to drop between iterations):

==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=186.202 (+~85ms)
first_command_lag_ms=221.242 (+~100ms)
command_lag_ms=79.953 (+~25ms)
input_lag_ms=17.478
exit_time_ms=72.503

And running again, stopping the cache dropping just to validate the original results:

creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=99.652
first_command_lag_ms=125.540
command_lag_ms=51.603
input_lag_ms=16.540
exit_time_ms=31.163

I understand that this is a poor excuse for a reproducible test; it's just what I could do quickly.

Based on these results, I stand by my supposition that random disk read speed (and number of files read) play a significant role in measured speeds. I'll investigate more, but if this holds true then reducing the number of sourced files may offer a significant speed increase.

Eriner commented 2 years ago

And more to the ticket at hand, your results do seem inconsistent.

This is from your raw benchmark data:

==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=125.768 (+30ms, +30%)
first_command_lag_ms=142.708 (+25ms, +17%)
command_lag_ms=97.411 (+50ms, +100%)
input_lag_ms=12.169
exit_time_ms=27.095
==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=10.149 (-10ms, -50%)
first_command_lag_ms=51.369 (-30ms, -40%)
command_lag_ms=2.392
input_lag_ms=4.758
exit_time_ms=6.017

Where the differentials are from a random KVM slice I ran for the test above.

Keep in mind that these thresholds may have different values for different people, machines, terminals, etc. I believe the ballpark should be the same though.

That doesn't seem to be the case. It seems like there can be quite a lot of variation, and using a multiplier based on a subjective scale really takes the context out of these measurements.

romkatv commented 2 years ago

And more to the ticket at hand, your results do seem inconsistent.

Just to clarify, by inconsistent you mean that benchmark results from your VM look different from the results I've got on my desktop, right?

If so, it's because of https://github.com/romkatv/zsh-bench/commit/223b536272633ba2e144e577436aae581bfbc4b2. Basically, if you create a bunch of files and immediately add them to git index, zim's prompt will be slow. But if you wait a bit between creating the files and adding them, zim's prompt will be fast. I added that sleep so that zim is benchmarked under more favorable conditions (even though in practice files are indeed often created and immediately added to the index; e.g., this happens when you clone a repo).

If you want to reproduce this on your machine, run this command and notice that prompt lags:

() {
  emulate -L zsh -o err_return -o pipe_fail
  local tmp
  tmp=$(mktemp -d)
  cd -- $tmp
  print -rC1 -- {1..10}/{1..10}/{1..10}         | xargs mkdir -p --
  git init
  print -rC1 -- {1..10}/{1..10}/{1..10}/{1..10} | xargs touch --
  (( sleep_before_add )) && sleep 1
  git add .
  git commit --quiet --allow-empty --allow-empty-message --no-gpg-sign -m ''
}

It doesn't always lag because it's inherently racy. Run the command a few times and you'll catch lag at some point.

If you now set sleep_before_add=1 and repeat the same command, prompt won't lag as much.

On a related note, if you run touch 1/1/1/1 (this file exists), zim's prompt will erroneously display [!]. If you run git status, [!] will disappear. This is a bug in zim as far as I can tell.

I haven't updated the benchmark results after adding that sleep 1. I'll probably do that today.

Edit: I've published new benchmark results.

That doesn't seem to be the case. It seems like there can be quite a lot of variation.

Can you share the reasoning behind this statement? It's possible but I wonder why you think it's true.

romkatv commented 2 years ago

I added --drop-caches <1|2|3> to zsh-bench. With --drop-caches N the following code is run before each iteration:

sync
/proc/sys/vm/drop_caches <<<N

Here are the results from my machine:

config drop caches first prompt lag first cmd lag cmd lag input lag
zim no 124%
🟠
53%
🟡
330%
🔴
62%
🟡
zim 1 164%
🟠
65%
🟡
334%
🔴
62%
🟡
zim 2 170%
🟠
67%
🟡
329%
🔴
62%
🟡
zim 3 219%
🔴
84%
🟡
330%
🔴
62%
🟡
zsh4humans no 20%
🟢
34%
🟢
24%
🟢
24%
🟢
zsh4humans 1 52%
🟡
48%
🟢
24%
🟢
23%
🟢
zsh4humans 2 53%
🟡
47%
🟢
25%
🟢
24%
🟢
zsh4humans 3 83%
🟡
61%
🟡
24%
🟢
24%
🟢

The effect of --drop-caches 3 is noticeable but not dramatic. Zim's first prompt lag goes from orange to red (relative increase under 100%) and zsh4human's two startup latencies go from green to yellow.

Based on these results I predict that reducing the number of files read by zim on startup will decrease cold startup latency by less than 50% and will have negligible effect on hot startup.

romkatv commented 2 years ago

On a related note, if you run touch 1/1/1/1 (this file exists), zim's prompt will erroneously display [!]. If you run git status, [!] will disappear. This is a bug in zim as far as I can tell.

Opened https://github.com/zimfw/zimfw/issues/441.

Eriner commented 2 years ago

Based on these results I predict that reducing the number of files read by zim on startup will decrease cold startup latency by less than 50% and will have negligible effect on hot startup.

50% is huge. Also, you have an NVMe! It's going to be fast no matter what! ;D

It'll rear its ugly head more with lower-classed IO devices (HDD, microSD, etc).

It seems like there can be quite a lot of variation.

I am once again stating that the device configuration plays a heavy role in the outcome, and that "the ballpark should be the same" isn't an accurate statement, as demonstrated by all of the results above. It is clear that the performance of each varies based on hardware performance to a significant degree. This is, again, why I think benchmarking on an RPI is good - if the shell scripts run with adequate speed on an RPI, they'll run with adequate speed just about anywhere.

Based on the few comments above, I wonder if it is (in large part) simply the synchronous calls to git that also share in large part responsibility for the latency. 1k directories and 10k files is a shitload - I don't think I have many (if any) repos on my disk that are that size. The fact that adding a sleep 1 improves performance is awfully suspect.

romkatv commented 2 years ago

The quote "the ballpark should be the same" is from https://github.com/romkatv/zsh-bench#how-fast-is-fast. It's talking about human perception of latency, which is independent of hardware. E.g., I can distinguish between 0ms and 10ms command lag but cannot distinguish between 0ms and <10ms. So for me the perception threshold of command lag is 10ms. For you this value might be different but "the ballpark should be the same".

I am once again stating that the device configuration plays a heavy role in the outcome. It is clear that the performance of each varies based on hardware performance to a significant degree.

Obviously.

if the shell scripts run with adequate speed on an RPI, they'll run with adequate speed just about anywhere.

Obviously.

Do I understand correctly that in your opinion it follows that it's a good use of one's time to benchmark on Rpi if one's goal is to make zim fast on popular hardware?

I think this does not follow. If one wanted to make zim fast on popular hardware, it would be more efficient to benchmark zim on popular hardware. This way optimizations would transfer to the goal better. It's not my time that's at stake though (I've no interest in optimizing zim), and I believe neither it is yours.

The fact that adding a sleep 1 improves performance is awfully suspect.

It really isn't. Diffing git index against workdir is faster if index is newer than files.

romkatv commented 2 years ago

Also, you have an NVMe! It's going to be fast no matter what! ;D

If by "it" you mean zim then it's a matter of what you call fast. As you can see in the benchmark results, zim has two latencies above 100% on my machine in a "hot" benchmark -- first prompt lag and command lag. This means I can perceive lag when using zim. You can still call it fast but not in an objective way. "Fast enough for me" would be a more precise characterization.

It's different with input lag. Since zim is below 100% on this metric, I cannot differentiate it from zero input lag. Here zim is objectively fast, no asterisks required.

Eriner commented 2 years ago

The percentages (scale) is your subjective timing, not mine. "Fast" is your own subjective opinion, as my subjective opinion is mine.

Do I understand correctly that in your opinion it follows that it's a good use of one's time to benchmark on Rpi if one's goal is to make zim fast on popular hardware?

If one wanted to make zim fast on popular hardware, it would be more efficient to benchmark zim on popular hardware.

Once again, yes. If you run these benchmarks and they're adequately performant on a slow, slow rpi, they will be adequately performant anywhere. There is zero reason to benchmark on fast machines if you're benchmarking on slow machines - fast machines will always be faster than slow machines. If there is a problem with speed, it will manifest on an rpi or slower hardware whereas it may not on high-spec machines manufactured in the year 2020.

Eriner commented 2 years ago

by "it" I meant disk IO. Specifically, dropping disk caches won't mean as much when testing with an NVMe; the disk IO is so fast that the caches (almost) don't matter. Comparing these results while running on a system with an microSD or HDD will yield vastly different results, I am nearly certain.

Eriner commented 2 years ago

the ballpark should be the same

I misunderstood this - I thought this was in reference to the results of the benchmark, not the subjective threshold values you have selected as the scale. This comment makes more sense in this context.

romkatv commented 2 years ago

The percentages (scale) is your subjective timing, not mine. "Fast" is your own subjective opinion, as my subjective opinion is mine.

OK, let's use "indistinguishable from instant" instead of "objectively fast". Even though different people have different perception, these values can still be measured objectively for each person without resorting to judgement. It's easy to check -- using a formal study protocol -- whether you can distinguish between the current prompt lag of zim and zero prompt lag. I'm sure you can by the way.

As I mentioned earlier, I really like "indistinguishable from instant" as the goal of optimization because it doesn't require judgement and because all users prefer software that's indistinguishable from instant (all else being equal). Zim may be fast enough for you but if you made it indistinguishable from instant users would choose the new version over old.

Once again, yes. If you run these benchmarks and they're adequately performant on a slow, slow rpi, they will be adequately performant anywhere.

I wonder why you repeated this statement despite my replying "obviously" to it the first time. This statement is tautologically true. Did you think I might think it false?

There is zero reason to benchmark on fast machines if you're benchmarking on slow machines

Not true. Let's say you want to make zim indistinguishable from instant on a representative dev machine -- say, a laptop or a desktop less than a decade old. A rather useful goal considering the fraction of zim users with such machines. You run a benchmark on Rpi and optimize something to reduce latency by 50%. Are you done? No. You still don't know whether zim is indistinguishable from instant on a representative dev machine. So you have to run a benchmark on the machine that you actually care about. You might find that your optimizations had little effect because Rpi is very different from a laptop or a desktop. One might wonder why you benchmarked on Rpi in the first place.

fast machines will always be faster than slow machines.

When I see so many obviously true statements, I start to suspect that our communication isn't working.

Comparing these results while running on a system with an microSD or HDD will yield vastly different results, I am nearly certain.

Maybe. Maybe not. If I cared about the performance of zsh on a cacheless Rpi, I would run this benchmark.

the subjective threshold values you have selected as the scale.

LOL

Also, you have an NVMe! It's going to be fast no matter what! ;D by "it" I meant disk IO

What do you mean by "fast"? Faster than microSD? Of course. Fast enough to make zim indistinguishable from instant? No. Hot or cold, my machine is not fast enough for that.

Proposal: Let's stop this conversation.

Eriner commented 2 years ago

One might wonder why you benchmarked on Rpi in the first place. Proposal: Let's stop this conversation.

I think you're misunderstanding my underlying statement here. I know that you know that slow machines are slower than fast machines, lol. I use quite a few rpis in my homelab and all are using zim. I notice some latency, but it isn't what I would consider to be "slow" or "problematic". My point in benchmarking these devices is that I use them, and making zim (perceptively) fast on these devices was one of my original drivers for the project.

ou still don't know whether zim is indistinguishable from instant on a representative dev machine.

I can't think of a single modification that could be made that would improve speed on an rpi that wouldn't improve performance on more performant hardware, thus I believe measuring the "lowest common denominator" of at least my personal usage is reasonable and correct. I'm not suggesting that this be benchmarked on ESP8266 microcontrollers or anything crazy here. It seems that you're concerned about performance that is representative of a "dev machine", which happens to have an NVMe drive. I am concerned about performance on less modern hardware.

Rpis are also much more 'reproducible' hardware than whatever hardware is in your personal dev machine.

If you still don't see the rationale in benchmarking on an rpi, feel free to ignore this comment.

romkatv commented 2 years ago

I think you're misunderstanding my underlying statement here. I know that you know that slow machines are slower than fast machines, lol. I use quite a few rpis in my homelab and all are using zim. I notice some latency, but it isn't what I would consider to be "slow" or "problematic". My point in benchmarking these devices is that I use them, and making zim (perceptively) fast on these devices was one of my original drivers for the project. [...} If you still don't see the rationale in benchmarking on an rpi, feel free to ignore this comment.

We are in agreement. You should benchmark on the machines on which you want to improve performance. Since your goal is to make your code fast on Rpi, it's logical to benchmark on Rpi.

I can't think of a single modification that could be made that would improve speed on an rpi that wouldn't improve performance on more performant hardware

I can. Rpi has different relative costs of sequential I/O, random access I/O and execution of instructions, so an optimization that reduces one and increases the other can reduce latency on Rpi while increasing it on a desktop machine. Only optimizations that strictly reduce something without increasing anything else have the property you've described, but then it goes both ways.

It seems that you're concerned about performance that is representative of a "dev machine", which happens to have an NVMe drive.

I want my code to be fast on the machines my users run. I invest my time optimizing code proportionally to the impact it'll have. If 1% of of my users are on Rpi, I'll invest 1% of my time optimizing for it. zsh4humans on Rpi has latencies slightly above my perception threshold -- the highest is 148%. It's pretty good, so I don't feel inclined to spend my time optimizing for Rpi. For comparison, the highest latency of zim on my desktop machine is 330%.

Rpis are also much more 'reproducible' hardware than whatever hardware is in your personal dev machine.

Why does it matter? My goal is to make my software fast on real hardware that my users have. If their hardware isn't readily available for me to benchmark on, it doesn't imply that I should benchmark on Rpi.

Proposal: Let's drop this discussion. I'm not gaining anything out of it and I suspect neither do you.

Eriner commented 2 years ago

Good luck, and neat project. Cheers

romkatv commented 2 years ago

Thanks! Have a great day 👍

romkatv commented 2 years ago

I'm closing the issue as resolved. Summary:

@ericbn reported that by using a different methodology to measure first prompt lag he obtained results that disagree with zsh-bench. This disagreement was caused by:

See https://github.com/romkatv/zsh-bench/issues/5#issuecomment-951094574 for details.

I've added a new section to the documentation to aid in debugging and validation of zsh-bench results.

If in the future you find that latencies reported by zsh-bench are incorrect, please open a new issue. Make sure to provide instructions that would allow me to reproduce your results.

romkatv commented 2 years ago

FYI: I ported zsh-bench to BSD and ran a couple of benchmarks on MacBook Air (M1, 2020).

Methodology:

  1. Open Apple Terminal and create a test user.
    () {
    builtin emulate -L zsh -o err_return
    command sudo dscl . -create /Users/$1
    command sudo dscl . -create /Users/$1 UserShell /bin/zsh
    command sudo dscl . -create /Users/$1 RealName "Zsh Bench User"
    command sudo dscl . -create /Users/$1 UniqueID 1042
    command sudo dscl . -create /Users/$1 PrimaryGroupID 20
    command sudo dscl . -create /Users/$1 NFSHomeDirectory /Users/$1
    command sudo mkdir /Users/$1
    command sudo chown $1:staff /Users/$1
    } zsh-bench-user
  2. Switch to the test user.
    sudo -u zsh-bench-user -i zsh -f
  3. Clone zsh-bench.
    git clone https://github.com/romkatv/zsh-bench.git ~/zsh-bench
  4. Install a predefined zsh config.
    () {
    emulate -L zsh -o extended_glob -o err_return
    rm -rf -- '' ~/^zsh-bench(ND)
    ~/zsh-bench/configs/$1/setup
    } zim
  5. Benchmark interactive zsh.
    ~/zsh-bench/zsh-bench --iters 2
  6. Repeat the last two steps if necessary. Note that step 4 wipes the home directory.

Raw results:

==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=60.195
first_command_lag_ms=71.779
command_lag_ms=38.418
input_lag_ms=17.777
exit_time_ms=20.824

==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=27.160
first_command_lag_ms=62.485
command_lag_ms=2.383
input_lag_ms=7.728
exit_time_ms=19.181

Normalized:

config tmux compsys syntax highlight auto suggest git prompt first prompt lag first cmd lag cmd lag input lag
zim ✔️ ✔️ ✔️ ✔️ 120%
🟠
48%
🟢
384%
🔴
89%
🟡
zsh4humans ✔️ ✔️ ✔️ ✔️ ✔️ 54%
🟡
42%
🟢
24%
🟢
39%
🟢

These results are surprisingly similar to what I've obtained on my dev machine running Linux. M1 is really fast! The only significant difference is first prompt lag in zsh4humans. It's 54% on the laptop compared to 20% on the desktop. I haven't looked why that might be but confirmed that zsh-bench reports the right number.

romkatv commented 2 years ago

@ericbn Hey, Eric. I saw that you've unarchived zimfw/zsh-framework-benchmark and made significant changes in https://github.com/zimfw/zsh-framework-benchmark/pull/9. This looks great! I also saw that you've implemented several optimizations in zim and I can confirm that they have user-visible effect.

This might be too early but I couldn't resist cross-validating benchmark results from your PR against zsh-bench. Hopefully you'll find this data useful.

I ran zimfw/zsh-framework-benchmark and zsh-bench on the same zsh configs with this command:

docker run -w /root -e TERM=xterm-256color -i --rm alpine sh -ues &>validation.log <<-\END
    apk add git curl zsh perl ncurses expect util-linux

    sed -i.bak -E '/^root:/ s|[^:]*$|/bin/zsh|' /etc/passwd
    rm /etc/passwd.bak
    export SHELL=/bin/zsh

    git clone https://github.com/romkatv/zsh-bench.git
    git clone -b time_to_prompt https://github.com/zimfw/zsh-framework-benchmark.git

    cd zsh-framework-benchmark
    rm frameworks/zinit-turbo.zsh frameworks/zplug.zsh

    git apply <<-\PATCH
        diff --git a/frameworks/zsh4humans.zsh b/frameworks/zsh4humans.zsh
        index 6b61795..c0fcf6b 100644
        --- a/frameworks/zsh4humans.zsh
        +++ b/frameworks/zsh4humans.zsh
        @@ -13,6 +13,9 @@ cp zsh4humans-${v}/{.zshenv,.zshrc} ./
         # install and load zimfw/git
         sed -i.bak -E 's|^(z4h install [^|]*)|\1 zimfw/git|' .zshrc
         sed -i.bak -E 's|^(z4h load [^|]*)|\1 $Z4H/zimfw/git|' .zshrc
        +sed -i.bak -E '1i\
        +zstyle :z4h: start-tmux no
        +' .zshrc
         rm -r v${v}.tar.gz zsh4humans-${v} .zshrc.bak
         # initialize zsh4humans
         HOME=${PWD} zsh -ic 'exit' </dev/null
    PATCH

    zsh -fc 'SHLVL=1 source ./run.zsh -k'

    zsh -fueco extended_glob '
        for cfg in vanilla ./results/^vanilla(/:t); do
            print -r -- "==> using $PWD/results/$cfg as home directory"
            HOME=$PWD/results/$cfg ../zsh-bench/zsh-bench --standalone --git no --iters 100
        done'
END

(Note that this command has tabs in it; they cannot be replaced with spaces.)

zsh-framework-benchmark was failing on the installation of zplug and on benchmarking of zinit-turbo, so I removed them. I didn't try to debug them.

zsh-framework-benchmark/frameworks/zsh4humans.zsh is the only config that starts tmux. This didn't use to matter when zsh-framework-benchmark was measuring zsh -lic exit but now it does. I've applied a patch to disable auto-start of tmux. I'd appreciate if you can apply this patch on zimfw/zsh-framework-benchmark. Edit: sent https://github.com/zimfw/zsh-framework-benchmark/pull/11.

zsh-bench is executed with --standalone, --git no and --iters 100 to match zsh-framework-benchmark. --standalone instructs zsh-bench to measure zsh startup while not having any other zsh running. --git no means to start zsh in a directory not under a git repo (there is also --git empty to start in a git repo with zero files). This basically models opening a terminal rather than opening/splitting a tab (the latter is what zsh-bench measures by default).

Here is the full output: validation.log

I've processed the raw output with this command:

() {
    emulate -L zsh -o err_return -o no_unset

    local line
    while true; do
        IFS= read -r line
        [[ $line == framework,mean,stddev,min,max ]] && break
    done

    local -a frameworks zfb_results
    while true; do
        IFS=, read -rA line
        (( $#line == 5 )) || break
        frameworks+=($line[1])
        zfb_results+=($line[4])
    done

    local -a zb_results
    while true; do
        IFS= read -r line
        [[ $line == first_prompt_lag_ms=* ]] || continue
        zb_results+=(${line#*=})
        (( $#zb_results == $#frameworks )) && break
    done

    local -i i
    print -r -- '| framework | zsh-framework-benchmark (ms) | zsh-bench (ms) | diff (ms) |'
    print -r -- '|-|-:|-:|-:|'
    for i in {1..$#frameworks}; do
        local fw=$frameworks[i]
        local -F x='1e3 * zfb_results[i]'
        local -F y='zb_results[i]'
        local -F d='y - x'
        printf '| %s | %.3f | %.3f | %+.3f |\n' $fw $x $y $d
    done
} <validation.log >validation.md

Here's the resulting table:

framework zsh-framework-benchmark (ms) zsh-bench (ms) diff (ms)
vanilla 1.866 1.082 -0.784
antibody-static 46.745 46.048 -0.697
antigen 46.556 46.278 -0.278
ohmyzsh 59.881 78.899 +19.018
prezto 68.663 67.396 -1.267
zcomet 41.094 41.378 +0.284
zgen 49.020 49.407 +0.387
zim 31.898 32.185 +0.287
zinit-light 156.822 162.334 +5.512
zsh4humans 5.999 5.330 -0.669

Latency reported by zsh-framework-benchmark matches first_prompt_lag from zsh-bench within 1ms. There are two exceptions where zsh-bench reports higher latency: zinit-light with +5.5ms (+3.5%) and ohmyzsh with +19ms (+32%). These deviation are large enough to require an explanation. I haven't looked at them yet.

P.S.

I would've opened an issue at zimfw/zsh-framework-benchmark but issues are disabled there.

ericbn commented 2 years ago

Hi @romkatv. Thanks for cross-checking zsh-framework-benchmark with zsh-bench!

As part of the patch for frameworks/zsh4humans.zsh, can we also install zsh-users/zsh-history-substring-search with it, so the framework configuration is closer to the others used in zsh-framework-benchmark?

romkatv commented 2 years ago

zsh-users/zsh-history-substring-search is enabled by default in zsh4humans.

romkatv commented 2 years ago

Latency reported by zsh-framework-benchmark matches _first_promptlag from zsh-bench within 1ms. There are two exceptions where zsh-bench reports higher latency: zinit-light with +5.5ms (+3.5%) and ohmyzsh with +19ms (+32%). These deviation are large enough to require an explanation. I haven't looked at them yet.

@ericbn Both exceptions have disappeared with your fix: https://github.com/zimfw/zsh-framework-benchmark/pull/10

framework zsh-framework-benchmark (ms) zsh-bench (ms) diff (ms)
ohmyzsh 77.810 77.402 -0.408
zinit-light 161.342 159.638 -0.704

Now all zsh-framework-benchmark results match _first_promptlag from zsh-bench --standalone --git no.

romkatv commented 2 years ago

Now all zsh-framework-benchmark results match _first_promptlag from zsh-bench --standalone --git no.

@ericbn I spoke too soon. _first_promptlag reported by zsh-bench appears to be about 0.5ms lower. To figure out whether this effect is real I created zsh-framework-benchmark/frameworks/no-rcs.zsh with the following content:

>>! $1/.zshenv <<\END
setopt no_rcs
PS1="%~%# "
END

With this config the difference in benchmark results is clear. The effect is real.

To decide which results are correct I created the same .zshenv in my home directory, started script -fqet$HOME/typescript.timing ~/typescript.out and ran the following command a few times:

clear && print st""art && exec /bin/zsh -l

I then looked at the typescript to see how long it takes between "start" and "~#" appearing on the screen. The numbers match what zsh-bench reports. In other words, zsh-framework-benchmark reports slightly higher startup latency than it should. Perhaps there is unaccounted overhead somewhere?

ericbn commented 2 years ago

To summarize what happened since this issue was opened:

Thanks to @romkatv's input and to his thoughts in zsh-bench, this is what we've improved in Zim:

2023-02-26 EDIT: We were still insisting in delaying the completion dumpfile check, but we're finally doing the same improved dumpfile check that zsh4humans uses during startup.

romkatv commented 2 years ago

Thanks for the update!