sharkdp / hexyl

A command-line hex viewer
Apache License 2.0
8.92k stars 227 forks source link

Implement character table control and codepage 437 option #195

Closed sharifhsn closed 6 months ago

sharifhsn commented 1 year ago

Implements #194

Additional comments:

sharifhsn commented 1 year ago

Here's a visual image of the difference. image

sharkdp commented 1 year ago

Wow! That was fast :smile:

Thank you so much.

I've directly copy-pasted the codepage table from the repository in the blog post you linked. I'm not sure how to properly give credit for it.

I think I saw a MIT license in the repository earlier today? If so, I would just copy the license text verbatim and paste it into a comment preceding the table. And add a reference to the original source file + author, if not present in the license text. Maybe let's also describe the manual modifications we did in that comment.

sharkdp commented 1 year ago

Look how cool a .gif file looks with this feature enabled :smile:

image

delan commented 1 year ago

Nice! I’m glad this idea is catching on.

Note that ␀ is already a printable character, since it’s U+2400 SYMBOL FOR NULL rather than U+0000 NULL. In fact, the whole range U+2400 through U+241F are printable representations of the C0 controls U+0000 through U+001F, though I chose to use the cp437 dingbats for the others. ⋄ probably makes more sense for hexyl though :)

FFh is also visually indistinguishable from 20h in this table, so for the Rewind path interpreter, we used U+FB00 LATIN SMALL LIGATURE FF (ff) for FFh.

sharkdp commented 1 year ago

Nice! I’m glad this idea is catching on.

Thank you very much for commenting here and for the original idea! A while ago, I built binocle, which uses a similar idea in a graphical application (distinctly colored pixels for each byte), but I never thought about translating this idea to the character panel of hexyl.

I'm now thinking that it might also be cool to have a mode in hexyl where we have distinct colors for every byte (not just for different categories). We could then print █ (U+2588) for every byte in the character table, but colorize it differently using ANSI escape sequences. As for this PR, we could maybe already add a --character-table block (or similar) variant where we simply print █ for each byte. With the category-colorization, it would look like this for the GIF example above:

image

Note that ␀ is already a printable character, since it’s U+2400 SYMBOL FOR NULL rather than U+0000 NULL. In fact, the whole range U+2400 through U+241F are printable representations of the C0 controls U+0000 through U+001F, though I chose to use the cp437 dingbats for the others.

We use (some of) those U+2400 … U+241F symbols in bat --show-all. I like the idea. But at least with my terminal emulator / font, they are usually really hard to read. Even at large font sizes, I can barely make out the "NUL" characters. This is why we chose for hexyl. @sharifhsn but we should update the comment for consistency: "modified to use the '⋄' character instead of '␀'"

FFh is also visually indistinguishable from 20h in this table, so for the Rewind path interpreter, we used U+FB00 LATIN SMALL LIGATURE FF (ff) for FFh.

@sharifhsn I think we should integrate the same adaptation here, if you agree?

sharifhsn commented 1 year ago

@sharifhsn but we should update the comment for consistency: "modified to use the '⋄' character instead of '␀'"

Makes sense.

@sharifhsn I think we should integrate the same adaptation here, if you agree?

Ditto.

I'm now thinking that it might also be cool to have a mode in hexyl where we have distinct colors for every byte (not just for different categories). We could then print █ (U+2588) for every byte in the character table, but colorize it differently using ANSI escape sequences. As for this PR, we could maybe already add a --character-table block (or similar) variant where we simply print █ for each byte.

Great idea! I've implemented this using the xterm colors as there are 256 distinct colors, which maps perfectly to the number of distinct bytes. I'm open to a change in color scheme. Here's how it looks now:

image

sharkdp commented 1 year ago

This is so cool.

Especially when combined with --border none, --no-squeezing, potentially --group-size 8 and --panels 8 on a wide screen:

image

But I would appreciated if we could focus this PR on --character-table and finish that first. Add a few basic tests. Make sure that it plays nicely with other features like e.g. squeezing. For example, I noticed that the squeezing character gets replaced by a block when --character-table=block. And maybe we should quickly check if this doesn't introduce any performance regression for the default case.

We can then explore the colorization part in a separate PR.

sharifhsn commented 1 year ago

I added a test for code page 437, and I've fixed the issue with the asterisk, which was a leftover bug from my original implementation anyway 😅

I've also refactored the color components into a new module, since they're taking up a lot of visual space and have separated concerns anyway.

With regards to performance, preliminary benchmarks show that performance is within noise for the default case and for code page 437, and significantly slower with block coloring (due to the fact that the color changes every single byte).

Command Mean [s] Min [s] Max [s] Relative
target/release/hexyl --character-table codepage-437 data 2.246 ± 0.087 2.169 2.442 1.00
target/release/hexyl --character-table block data 4.412 ± 0.120 4.229 4.567 1.96 ± 0.09
target/release/hexyl data 2.490 ± 0.357 2.195 3.234 1.11 ± 0.16
hexyl data 2.274 ± 0.142 2.071 2.579 1.01 ± 0.07

With regards to colorization and the block table, it's all the same machinery. I would prefer to push this all in one PR unless there's some outstanding issue with the block table that can't be resolved quickly.

sharkdp commented 11 months ago

I've also refactored the color components into a new module, since they're taking up a lot of visual space and have separated concerns anyway.

Thank you for the updates. I would still prefer to move the colorization part to a new PR.

sharifhsn commented 6 months ago

@sharkdp The colorization of characters has been factored out. This PR should be ready to merge.