Measure rendering performance in terminals

lilyball commented 5 years ago

We've been benchmarking the performance of the tool without considering the rendering performance of the terminal. Specifically, I'm thinking about how we turn colors on and off again for every single hex pair and textual character. Ideally we wouldn't turn colors off if the next printed hex/char uses the same color.

I'm not really sure how to programmatically measure the terminal performance (and of course performance would change for different terminals), but it's worth at least trying to measure. Optimizing our color usages would be more overhead on our side and therefore slow down our benchmark (though perhaps not significantly) but if it produces faster rendering it might be worth it.

At the very least, we could investigate not printing the style suffix for each character, under the assumption that the style prefix for the next character will suffice (and then just printing the suffix prior to printing a frame character).

kilobyte commented 5 years ago

For this task you may want my termrec — it has tools to record and replay a terminal stream with timing information. There's a library with two parts: for storing/rewinding/etc the raw or frame-by-frame stream, and libtty to keep the state of the terminal. You could instrument the latter to get performance data.

sharkdp commented 5 years ago

Another option is to use hyperfines --show-output flag. It will simply loop through all the output of the benchmarked commands instead of piping to /dev/null.

Without a TTY:

Command	Mean [ms]	Min…Max [ms]
`hexyl $(which hexyl)`	173.2 ± 5.4	166.9…186.1
`hexdump -C $(which hexyl)`	194.3 ± 6.4	188.0…215.4
`xxd $(which hexyl)`	74.0 ± 2.1	70.9…83.4

With alacritty:

Command	Mean [ms]	Min…Max [ms]
`hexyl $(which hexyl)`	510.9 ± 16.3	493.5…544.9
`hexdump -C $(which hexyl)`	374.7 ± 25.0	347.7…431.4
`xxd $(which hexyl)`	227.5 ± 14.1	205.8…244.8

With terminator:

Command	Mean [s]	Min…Max [s]
`hexyl $(which hexyl)`	1.730 ± 0.047	1.659…1.807
`hexdump -C $(which hexyl)`	0.632 ± 0.019	0.598…0.661
`xxd $(which hexyl)`	0.465 ± 0.024	0.427…0.502

We can observe:

Yes, the terminal emulator rendering time is definitely significant.
Both terminals take much more time to render hexyls output (relatively speaking), as expected - due to the colors.
alacritty is freaking fast :smile:

Specifically, I'm thinking about how we turn colors on and off again for every single hex pair and textual character. Ideally we wouldn't turn colors off if the next printed hex/char uses the same color.

Optimizing our color usages would be more overhead on our side and therefore slow down our benchmark (though perhaps not significantly) but if it produces faster rendering it might be worth it.

I wouldn't really bother doing this. I don't think that the current performance is problematic in any way. A more interesting benchmark could be to measure the execution time for rather small files. There might be a startup latency for hexyl due to its larger binary size, as is typical for Rust programs.

At the very least, we could investigate not printing the style suffix for each character, under the assumption that the style prefix for the next character will suffice (and then just printing the suffix prior to printing a frame character).

Yes, we could probably do this to also save on the bandwidth.

sharkdp commented 5 years ago

Don't get me wrong. I love fast programs. I just don't think that hexyl really has a performance problem. Specifically after your recent PR which made it several times faster. I'm never going to output binary blobs of 1MB or larger to the terminal. And if I am, I don't really care if it takes the hexyl 500 ms to print the 60,000 lines of output to the terminal.

kilobyte commented 5 years ago

I on the other hand very often run hd|less on very large files (seeking to an interesting part, of course). With hexyl, this would be less -R goodness. So this request isn't completely without point.

sharkdp commented 5 years ago

Valid point, but the usage of a pager will help you with the rendering speed because only the current page has to be printed.

lilyball commented 5 years ago

What might be interesting to measure is something like hexyl | less -R and then immediately trying to view the final page.

sharkdp commented 5 years ago

What might be interesting to measure is something like hexyl | less -R and then immediately trying to view the final page.

My hope would be that this would be pretty much the time that we get without a TTY.

hexyl $(which hexyl) | less -R and subsequent Shit+G is definitely much faster for me than waiting for hexyl $(which hexyl) to be finished.

Is there anything more we want to do here or can this be closed?

lilyball commented 5 years ago

Personally, I am still interested in the performance when just printing directly to Terminal.app. I'd like to do some investigation of this on my own and see if there are some easy wins, so if you don't mind I want to keep the ticket open for at least a little while.

lilyball commented 5 years ago

In a quick test, removing the suffixes and inserting reset sequences before the frame chars results in an approximately 14% slowdown on the benchmark, but a 30% speedup when actually rendering to Terminal.app.

kitlith commented 5 years ago

What might be interesting to measure is something like hexyl | less -R and then immediately trying to view the final page.

Wouldn't that be equivalent to something like hexyl | tail -n <height of terminal> if you don't want to take overhead from less into account?

lilyball commented 5 years ago

@kitlith tail -n 30 would give you the last 30 unwrapped lines of output. less performs wrapping. That said, less seems to be pretty smart about jumping to end given how fast it can do it, so it's clearly not calculating wrap points for any non-displayed lines.

kilobyte commented 5 years ago

Well, less skips to the end then explicitly says "Calculating line numbers..." while you already see the final screen.

kitlith commented 5 years ago

Point is, less calculating line numbers or doing line-wrapping isn't the focus of this issue? It's the rendering performance. less (shift-G) is not as benchmarkable as just showing the last few lines of a dump with something like tail, or copying the output and displaying it to the screen directly.

lilyball commented 5 years ago

Rendering performance is mostly about how fast the terminal emulator state machine can process the escape codes and text. less isn't a great measure here because its line number calculation is hard-wrapped lines (and therefore just needs to scan for newlines rather than running the full state machine for all non-displayed lines), but given that piping to less is expected to be a common use-case it's possibly more important than the time it takes for the terminal to render the actual full output of Hexyl.

mqudsi commented 5 years ago

Is the question just about the trade-off between optimizing for one case at the cost of the other?

@lilyball I don't know how involved the changes you made were (in terms of LOC) but perhaps you can just gate them based off of whether or not hexyl is outputting directly to a tty?

kilobyte commented 5 years ago

No, it's about adding some complexity to optimize a case that some dismiss as unimportant. More code = maintenance cost.

On the other hand, performance cost of comparing a few variables is so small that I'd guess even shaving some work from printf-equivalent and sending the data via pipe would already be a win — much less going into rendering in the terminal.

sharkdp commented 5 years ago

I'm going to close this. If anybody feels that hexyl is (still) too slow when writing to a terminal, please let me know.

sharkdp / hexyl

Measure rendering performance in terminals #29