Various performance optimizations

sharifhsn commented 1 year ago

I did some profiling of hexyl using cargo flamegraph and identified some key performance optimizations that could be made.

The three methods that take up the most time are:

write_all into the buffer and then stdout
format! call in print_position_panel
multiple uses of the write! and writeln! macro in print_* functions

flamegraph

The three commits in the pull request address these three issues.

The first replaces the manual buffers used by hexyl for reading and writing by BufReader and BufWriter, respectively. This required some major refactoring of code because of the way that buffer_line and raw_line were used. I feel that the refactoring was appropriate and made the code much easier to read, so it was worth it even without performance optimizations. With this addition, here is the benchmark:

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexyl data`	1.068 ± 0.015	1.045	1.084	1.23 ± 0.03
`hexyl --no-squeezing data`	1.073 ± 0.035	1.041	1.164	1.23 ± 0.04
`target/release/hexyl data`	0.895 ± 0.019	0.871	0.917	1.03 ± 0.03
`target/release/hexyl --no-squeezing data`	0.871 ± 0.013	0.851	0.899	1.00

The second commit replaces the format! call in print_position_panel with a workaround. Previously, the function would take the position and format it as hexademical, then painted it if color was enabled. This was expensive! In order to avoid this, I used the same strategy as with the byte_hex_panel and byte_char_panel and precomputed the values in the new function so they would only be created once through byte_hex_panel_g and byte_char_panel_g. This adds a tiny amount of overhead to the startup but it's invisible in profiling. Here is the benchmark with both the first and second commit:

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexyl data`	1.181 ± 0.025	1.135	1.215	1.67 ± 0.05
`hexyl --no-squeezing data`	1.122 ± 0.024	1.091	1.165	1.59 ± 0.05
`target/release/hexyl data`	0.738 ± 0.015	0.713	0.767	1.04 ± 0.03
`target/release/hexyl --no-squeezing data`	0.706 ± 0.014	0.687	0.729	1.00

The last and most dramatic optimization was the elimination of the write! macro sprinkled throughout the three print methods that are repeated every line, replacing it with the more direct write_all method on self.writer. This wasn't actually too difficult, because all of the write! macro calls didn't use any formatting machinery. I didn't remove every use of write! because profiling shows that it wasn't important outside of the loop. Here is the benchmark with all of these commits:

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexyl data`	1.062 ± 0.027	1.024	1.101	3.69 ± 0.11
`hexyl --no-squeezing data`	1.072 ± 0.011	1.060	1.099	3.73 ± 0.07
`target/release/hexyl data`	0.332 ± 0.007	0.323	0.343	1.15 ± 0.03
`target/release/hexyl --no-squeezing data`	0.287 ± 0.005	0.282	0.299	1.00

And here is the final flamegraph:

flamegraph

sharifhsn commented 1 year ago

As an interesting footnote, I looked over #66 when writing this for reference on profiling. With the new optimizations, hexyl is only about 1.5x slower than xxd on pure random data:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`xxd data`	116.9 ± 2.7	112.9	122.8	1.00
`target/release/hexyl data`	183.8 ± 5.2	179.5	199.3	1.57 ± 0.06

but beats it by 2.5x on pure squeezed data:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`xxd data`	113.5 ± 2.8	110.6	122.1	2.63 ± 0.13
`target/release/hexyl data`	43.2 ± 1.8	41.8	54.7	1.00

For hexdump, the opposite is true. hexyl is faster by about 5.5x on pure random data:

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexdump -C data`	1.031 ± 0.012	1.016	1.050	5.67 ± 0.09
`target/release/hexyl data`	0.182 ± 0.002	0.179	0.187	1.00

but is about 4.25x slower on pure squeezed data:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`hexdump -C data`	10.2 ± 0.3	9.2	11.5	1.00
`target/release/hexyl data`	43.2 ± 1.4	41.8	48.4	4.25 ± 0.20

sharkdp commented 1 year ago

Awesome work and fantastic documentation - Thank you very much! It might be a few days before I get to a full review.

but is about 4.25x slower on pure squeezed data:

I don't really know how hexdump does this. To be completely honest, I don't even know how we do it currently. But I could imagine that the fastest way would be to use something similarly optimized as memchr (Rust wrapper: https://docs.rs/memchr/latest/memchr/). But in reverse (search for anything but \0).

sharifhsn commented 1 year ago

I've added a few micro-optimizations, but there's a limit to how much more I can optimize.

The fundamental problem is that hexyl processes the input byte by byte instead of line by line. It's not possible to search an entire line at once to check if a non-squeeze byte is in it because hexyl doesn't know ahead of time what the next bytes are. This places a hard limit on how fast the squeezer can be.

Personally, I'm not a fan of this architectural decision and I would like to change it so it's line by line. That way, each squeezed line could be skipped at the read step before trying to print it. It's impossible to profile without actually writing it but I think this could be another opportunity for optimization.

However, that would essentially constitute an entire rewrite of Printer and Squeezer. I'm down for the challenge, but I think the optimizations I've already written stand on their own. We should merge this PR, then I'll do some experimental rewriting to see if it's worth it.

sharifhsn commented 1 year ago

Also, implementing a version of memchr would almost certainly not be worth it. The haystack in this case is the width of a single line, which will usually be 16 bytes and probably not more than 128. memchr is optimized for throughput, not latency. It was worth investigating though, and I might use it in some other future program.

sharkdp commented 1 year ago

Also, implementing a version of memchr would almost certainly not be worth it. The haystack in this case is the width of a single line, which will usually be 16 bytes and probably not more than 128. memchr is optimized for throughput, not latency. It was worth investigating though, and I might use it in some other future program.

Hm, why would the haystack be limited to a single line? I was thinking of files with large blocks of zeros. Couldn't we "scan ahead" and only then print the required output? Something like this: If we discover a full line of zeros with our usual (byte-based/line-based) method, don't print anything yet. Start a fast lookahead search for the next nonzero byte to figure out the size of the zero-block. Print the appropriate "squeezed lines" output. Continue normally after the zero block.

The fundamental problem is that hexyl processes the input byte by byte instead of line by line. It's not possible to search an entire line at once to check if a non-squeeze byte is in it because hexyl doesn't know ahead of time what the next bytes are. This places a hard limit on how fast the squeezer can be.

Personally, I'm not a fan of this architectural decision and I would like to change it so it's line by line. That way, each squeezed line could be skipped at the read step before trying to print it. It's impossible to profile without actually writing it but I think this could be another opportunity for optimization.

Sounds great. To be completely honest with you, I'm not at all happy with the current architecture as well. hexyl grew way too fast from a very simple tool to something more feature complete, and I haven't payed enough attention to SW architecture. Mostly, because I wasn't too interested in extending the project anymore. Many different contributors wanted to add new features and I failed as a maintainer in terms of keeping a certain quality.

So I'd be perfectly fine with a major redesign if it also improves the structure of the code. If it's only for the sake of squeezing (heh) even more performance out, I'm not so sure. I am a big fan of fast tools, but I also believe that optimizing hexyl is more of an intellectual exercise. As I have mentioned elsewhere, it's already faster than most terminal emulators can handle... and I haven't really heard of any real world use cases where hexyl is too slow.

sharifhsn commented 1 year ago

I've completed the architectural rewrite, which can be tested for speed. On performance, it is slightly faster than xxd on pure random data, and 1.7x faster than hexdump on squeezed data. This rewrite significantly simplifies the code structure, making the entire squeezer module unnecessary. By processing each line at a time, squeezing can be done at the read step instead of after-the-fact.

It still has a major roadblock in terms of passing tests, because it currently doesn't print out the last line when it is incomplete. I'm working on fixing this.

sharifhsn commented 1 year ago

The fix for the line ending led to a performance reduction, so hexyl is no longer faster than xxd on pure random data, and the gap between it and hexdump has decreased.

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`xxd randomdata`	118.8 ± 1.9	115.8	122.0	1.00
`target/release/hexyl randomdata`	142.5 ± 3.0	139.0	151.7	1.20 ± 0.03

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`hexdump -C zerodata`	990.0 ± 14.2	974.8	1018.3	1.16 ± 0.03
`target/release/hexyl zerodata`	849.8 ± 14.0	833.4	872.8	1.00

sharifhsn commented 1 year ago

I fixed a few corner cases where it printed incorrectly.

I also added a big optimization when squeezing zero bytes, since that is by far the most common usage of squeezing. Instead of checking that every single byte in the line is the same, the line is divided into usize-sized chunks, and each chunk is checked to equal zero. Checking that a usize equals zero optimizes extremely well.

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`hexdump -C zerodata`	997.0 ± 19.6	960.7	1030.9	2.88 ± 0.09
`target/release/hexyl zerodata`	346.0 ± 8.6	338.6	367.4	1.00

sharifhsn commented 1 year ago

For what it's worth, the improvement of the squeezer over the original is extreme, which is the only performance benchmark that really matters. As you've noted, the performance of printing out random data is limited by the terminal emulator. However, the squeezing of data is now faster than any hex viewer I could find in my system package repositories. For squeezing zero data, it is now 67 times faster.

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexyl zerodata`	11.968 ± 0.070	11.883	12.072	67.33 ± 0.85
`target/release/hexyl zerodata`	0.178 ± 0.002	0.173	0.180	1.00

sharkdp commented 1 year ago

Thank you very much. That sounds absolutely fantastic!

I would like to see some (unit and/or integration) tests for the squeezing behavior before we merge this. I know those tests should already be there (see also my comment above). But it's better to introduce them now than never. Performance is great, but I would like to ensure that we're (still) doing everything correctly first. And the (new) code is complex enough to warrant some tests, I believe.

Now this does not have to be done by you, of course. We can also write those tests in another PR and then come back to this later.

sharifhsn commented 1 year ago

Great idea! I've implemented three tests that should cover the squeeze behavior.

Unnecessary information beyond this point...

Interestingly, I ran into some unexplainable buggy behavior while creating the tests. When I call read on the reader, it should pull as many bytes as possible into the buffer. If the end of file is reached, it will pull less bytes into the buffer than the buffer's size. This can be checked for, which is the "leftover" case in the cleanup code. Curiously, when an odd number of panels are used, and the squeeze ends with eight squeeze bytes followed by non-squeeze bytes, then read will suddenly terminate, which gives the appearance that end of file has been reached. Luckily, this behavior is caught by the hello_world_elf64 test file, so I was able to find the issue and solve it by applying a second read check to assure that end of file has been reached. The question for me then, is why this happens? I dug through the standard library code and it seems to be just using libc's read function. Not sure if there's some bizarre optimization going on or some weird aspect of Linux file reading. Either way, the issue is fixed.

sharifhsn commented 1 year ago

Are there any other blockers to merging?

sharkdp commented 1 year ago

Thank you so much for the updates. And for discovering and fixing that buggy behavior. Sorry for all the delays, but I'd like to understand this better before merging.

When I call read on the reader, it should pull as many bytes as possible into the buffer.

According to the documentation, this is not necessarily the case:

It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.

Curiously, when an odd number of panels are used, and the squeeze ends with eight squeeze bytes followed by non-squeeze bytes, then read will suddenly terminate, which gives the appearance that end of file has been reached. Luckily, this behavior is caught by the hello_world_elf64 test file, so I was able to find the issue and solve it by applying a second read check to assure that end of file has been reached.

Can I reproduce this on master somehow?

sharifhsn commented 1 year ago

According to the documentation, this is not necessarily the case:

Hmm, that's good to know. The workaround I made checks again to see if read returns Ok(0), which is guaranteed to be EOF as per the documentation, so that checks out.

Can I reproduce this on master somehow?

It was a bug introduced by my architectural rewrite, so no. Apparently the former method of reading one byte at a time doesn't cause this issue, which makes sense.

sharkdp commented 1 year ago

Thank you very much for your patience and this great work. Not just does it make hexyl faster, but also better structured and tested :+1:.

More of a note to self: I have to admit that I did not fully review the code in all detail, something that I usually try to do. The reasons are twofold: (1) from the whole conversation, the concept, and the superficial code review, this seems like a high quality contribution. (2) Due to time restrictions, I need to prioritize my open source projects a bit. Over the years, hexyl has dropped down a bit on the list of projects I care about deeply.

sharifhsn commented 1 year ago

I totally get it, compared to your other projects it's somewhat less popular. I just got interested in this particular problem for myself, and I'm happy I could contribute to an application used by thousands of people ❤️

sharkdp commented 1 year ago

I totally get it, compared to your other projects it's somewhat less popular.

To be clear, that's not the deciding factor :smile:.

I just got interested in this particular problem for myself, and I'm happy I could contribute to an application used by thousands of people heart

I'm glad you did!

sharkdp / hexyl

Various performance optimizations #173