Closed sharifhsn closed 1 year ago
As an interesting footnote, I looked over #66 when writing this for reference on profiling. With the new optimizations, hexyl
is only about 1.5x slower than xxd
on pure random data:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
xxd data |
116.9 ± 2.7 | 112.9 | 122.8 | 1.00 |
target/release/hexyl data |
183.8 ± 5.2 | 179.5 | 199.3 | 1.57 ± 0.06 |
but beats it by 2.5x on pure squeezed data:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
xxd data |
113.5 ± 2.8 | 110.6 | 122.1 | 2.63 ± 0.13 |
target/release/hexyl data |
43.2 ± 1.8 | 41.8 | 54.7 | 1.00 |
For hexdump
, the opposite is true. hexyl
is faster by about 5.5x on pure random data:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
hexdump -C data |
1.031 ± 0.012 | 1.016 | 1.050 | 5.67 ± 0.09 |
target/release/hexyl data |
0.182 ± 0.002 | 0.179 | 0.187 | 1.00 |
but is about 4.25x slower on pure squeezed data:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
hexdump -C data |
10.2 ± 0.3 | 9.2 | 11.5 | 1.00 |
target/release/hexyl data |
43.2 ± 1.4 | 41.8 | 48.4 | 4.25 ± 0.20 |
Awesome work and fantastic documentation - Thank you very much! It might be a few days before I get to a full review.
but is about 4.25x slower on pure squeezed data:
I don't really know how hexdump
does this. To be completely honest, I don't even know how we do it currently. But I could imagine that the fastest way would be to use something similarly optimized as memchr
(Rust wrapper: https://docs.rs/memchr/latest/memchr/). But in reverse (search for anything but \0
).
I've added a few micro-optimizations, but there's a limit to how much more I can optimize.
The fundamental problem is that hexyl
processes the input byte by byte instead of line by line. It's not possible to search an entire line at once to check if a non-squeeze byte is in it because hexyl
doesn't know ahead of time what the next bytes are. This places a hard limit on how fast the squeezer can be.
Personally, I'm not a fan of this architectural decision and I would like to change it so it's line by line. That way, each squeezed line could be skipped at the read step before trying to print it. It's impossible to profile without actually writing it but I think this could be another opportunity for optimization.
However, that would essentially constitute an entire rewrite of Printer
and Squeezer
. I'm down for the challenge, but I think the optimizations I've already written stand on their own. We should merge this PR, then I'll do some experimental rewriting to see if it's worth it.
Also, implementing a version of memchr
would almost certainly not be worth it. The haystack in this case is the width of a single line, which will usually be 16 bytes and probably not more than 128. memchr
is optimized for throughput, not latency. It was worth investigating though, and I might use it in some other future program.
Also, implementing a version of
memchr
would almost certainly not be worth it. The haystack in this case is the width of a single line, which will usually be 16 bytes and probably not more than 128.memchr
is optimized for throughput, not latency. It was worth investigating though, and I might use it in some other future program.
Hm, why would the haystack be limited to a single line? I was thinking of files with large blocks of zeros. Couldn't we "scan ahead" and only then print the required output? Something like this: If we discover a full line of zeros with our usual (byte-based/line-based) method, don't print anything yet. Start a fast lookahead search for the next nonzero byte to figure out the size of the zero-block. Print the appropriate "squeezed lines" output. Continue normally after the zero block.
The fundamental problem is that
hexyl
processes the input byte by byte instead of line by line. It's not possible to search an entire line at once to check if a non-squeeze byte is in it becausehexyl
doesn't know ahead of time what the next bytes are. This places a hard limit on how fast the squeezer can be.Personally, I'm not a fan of this architectural decision and I would like to change it so it's line by line. That way, each squeezed line could be skipped at the read step before trying to print it. It's impossible to profile without actually writing it but I think this could be another opportunity for optimization.
Sounds great. To be completely honest with you, I'm not at all happy with the current architecture as well. hexyl grew way too fast from a very simple tool to something more feature complete, and I haven't payed enough attention to SW architecture. Mostly, because I wasn't too interested in extending the project anymore. Many different contributors wanted to add new features and I failed as a maintainer in terms of keeping a certain quality.
So I'd be perfectly fine with a major redesign if it also improves the structure of the code. If it's only for the sake of squeezing (heh) even more performance out, I'm not so sure. I am a big fan of fast tools, but I also believe that optimizing hexyl
is more of an intellectual exercise. As I have mentioned elsewhere, it's already faster than most terminal emulators can handle... and I haven't really heard of any real world use cases where hexyl
is too slow.
I've completed the architectural rewrite, which can be tested for speed. On performance, it is slightly faster than xxd
on pure random data, and 1.7x faster than hexdump
on squeezed data. This rewrite significantly simplifies the code structure, making the entire squeezer module unnecessary. By processing each line at a time, squeezing can be done at the read step instead of after-the-fact.
It still has a major roadblock in terms of passing tests, because it currently doesn't print out the last line when it is incomplete. I'm working on fixing this.
The fix for the line ending led to a performance reduction, so hexyl
is no longer faster than xxd
on pure random data, and the gap between it and hexdump has decreased.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
xxd randomdata |
118.8 ± 1.9 | 115.8 | 122.0 | 1.00 |
target/release/hexyl randomdata |
142.5 ± 3.0 | 139.0 | 151.7 | 1.20 ± 0.03 |
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
hexdump -C zerodata |
990.0 ± 14.2 | 974.8 | 1018.3 | 1.16 ± 0.03 |
target/release/hexyl zerodata |
849.8 ± 14.0 | 833.4 | 872.8 | 1.00 |
I fixed a few corner cases where it printed incorrectly.
I also added a big optimization when squeezing zero bytes, since that is by far the most common usage of squeezing. Instead of checking that every single byte in the line is the same, the line is divided into usize
-sized chunks, and each chunk is checked to equal zero. Checking that a usize
equals zero optimizes extremely well.
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
hexdump -C zerodata |
997.0 ± 19.6 | 960.7 | 1030.9 | 2.88 ± 0.09 |
target/release/hexyl zerodata |
346.0 ± 8.6 | 338.6 | 367.4 | 1.00 |
For what it's worth, the improvement of the squeezer over the original is extreme, which is the only performance benchmark that really matters. As you've noted, the performance of printing out random data is limited by the terminal emulator. However, the squeezing of data is now faster than any hex viewer I could find in my system package repositories. For squeezing zero data, it is now 67 times faster.
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
hexyl zerodata |
11.968 ± 0.070 | 11.883 | 12.072 | 67.33 ± 0.85 |
target/release/hexyl zerodata |
0.178 ± 0.002 | 0.173 | 0.180 | 1.00 |
Thank you very much. That sounds absolutely fantastic!
I would like to see some (unit and/or integration) tests for the squeezing behavior before we merge this. I know those tests should already be there (see also my comment above). But it's better to introduce them now than never. Performance is great, but I would like to ensure that we're (still) doing everything correctly first. And the (new) code is complex enough to warrant some tests, I believe.
Now this does not have to be done by you, of course. We can also write those tests in another PR and then come back to this later.
Great idea! I've implemented three tests that should cover the squeeze behavior.
Unnecessary information beyond this point...
Interestingly, I ran into some unexplainable buggy behavior while creating the tests. When I call read
on the reader, it should pull as many bytes as possible into the buffer. If the end of file is reached, it will pull less bytes into the buffer than the buffer's size. This can be checked for, which is the "leftover" case in the cleanup code. Curiously, when an odd number of panels are used, and the squeeze ends with eight squeeze bytes followed by non-squeeze bytes, then read
will suddenly terminate, which gives the appearance that end of file has been reached. Luckily, this behavior is caught by the hello_world_elf64
test file, so I was able to find the issue and solve it by applying a second read check to assure that end of file has been reached. The question for me then, is why this happens? I dug through the standard library code and it seems to be just using libc
's read
function. Not sure if there's some bizarre optimization going on or some weird aspect of Linux file reading. Either way, the issue is fixed.
Are there any other blockers to merging?
Thank you so much for the updates. And for discovering and fixing that buggy behavior. Sorry for all the delays, but I'd like to understand this better before merging.
When I call
read
on the reader, it should pull as many bytes as possible into the buffer.
According to the documentation, this is not necessarily the case:
It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.
Curiously, when an odd number of panels are used, and the squeeze ends with eight squeeze bytes followed by non-squeeze bytes, then
read
will suddenly terminate, which gives the appearance that end of file has been reached. Luckily, this behavior is caught by thehello_world_elf64
test file, so I was able to find the issue and solve it by applying a second read check to assure that end of file has been reached.
Can I reproduce this on master
somehow?
According to the documentation, this is not necessarily the case:
Hmm, that's good to know. The workaround I made checks again to see if read
returns Ok(0)
, which is guaranteed to be EOF as per the documentation, so that checks out.
Can I reproduce this on
master
somehow?
It was a bug introduced by my architectural rewrite, so no. Apparently the former method of reading one byte at a time doesn't cause this issue, which makes sense.
Thank you very much for your patience and this great work. Not just does it make hexyl
faster, but also better structured and tested :+1:.
More of a note to self: I have to admit that I did not fully review the code in all detail, something that I usually try to do. The reasons are twofold: (1) from the whole conversation, the concept, and the superficial code review, this seems like a high quality contribution. (2) Due to time restrictions, I need to prioritize my open source projects a bit. Over the years, hexyl
has dropped down a bit on the list of projects I care about deeply.
I totally get it, compared to your other projects it's somewhat less popular. I just got interested in this particular problem for myself, and I'm happy I could contribute to an application used by thousands of people ❤️
I totally get it, compared to your other projects it's somewhat less popular.
To be clear, that's not the deciding factor :smile:.
I just got interested in this particular problem for myself, and I'm happy I could contribute to an application used by thousands of people heart
I'm glad you did!
I did some profiling of
hexyl
usingcargo flamegraph
and identified some key performance optimizations that could be made.The three methods that take up the most time are:
write_all
into the buffer and thenstdout
format!
call inprint_position_panel
write!
andwriteln!
macro inprint_*
functionsThe three commits in the pull request address these three issues.
The first replaces the manual buffers used by
hexyl
for reading and writing byBufReader
andBufWriter
, respectively. This required some major refactoring of code because of the way thatbuffer_line
andraw_line
were used. I feel that the refactoring was appropriate and made the code much easier to read, so it was worth it even without performance optimizations. With this addition, here is the benchmark:hexyl data
hexyl --no-squeezing data
target/release/hexyl data
target/release/hexyl --no-squeezing data
The second commit replaces the
format!
call inprint_position_panel
with a workaround. Previously, the function would take the position and format it as hexademical, then painted it if color was enabled. This was expensive! In order to avoid this, I used the same strategy as with thebyte_hex_panel
andbyte_char_panel
and precomputed the values in thenew
function so they would only be created once throughbyte_hex_panel_g
andbyte_char_panel_g
. This adds a tiny amount of overhead to the startup but it's invisible in profiling. Here is the benchmark with both the first and second commit:hexyl data
hexyl --no-squeezing data
target/release/hexyl data
target/release/hexyl --no-squeezing data
The last and most dramatic optimization was the elimination of the
write!
macro sprinkled throughout the threeprint
methods that are repeated every line, replacing it with the more directwrite_all
method onself.writer
. This wasn't actually too difficult, because all of thewrite!
macro calls didn't use any formatting machinery. I didn't remove every use ofwrite!
because profiling shows that it wasn't important outside of the loop. Here is the benchmark with all of these commits:hexyl data
hexyl --no-squeezing data
target/release/hexyl data
target/release/hexyl --no-squeezing data
And here is the final flamegraph: