hpjansson / chafa

📺🗿 Terminal graphics for the 21st century.
https://hpjansson.org/chafa/
GNU Lesser General Public License v3.0
2.83k stars 60 forks source link

sixel support #27

Closed cdluminate closed 4 years ago

cdluminate commented 5 years ago

You mentioned sixel in TODO. However I didn't realize how sixel perform well in the terminal until I see this: https://github.com/hackerb9/lsix . It's a simple bash script, and it might provide some inspiration?

hpjansson commented 5 years ago

Yes, that's useful. Thanks. Another interesting project is libsixel: https://github.com/saitoha/libsixel

Sixels isn't that hard to do -- the hard part is the design/API decisions that need to be made to make it fit as seamlessly as possible alongside character-cell-based graphics. There is some research to be done too, like how feasible would it be to mix sixels and character cells, e.g. what do the various terminals do with existing sixel graphics when you change the font size or overdraw it with text?

I'll need to look at it more carefully, but I think the best way to implement sixel support in Chafa may be to do it directly in the frontend and leave libchafa to deal with character-cell graphics only, since it's already built around a character-cell API. Maybe.

srlehn commented 5 years ago

Hi, I am currently implementing sixel support for a TUI https://github.com/gizak/termui/pull/233.

I first check the terminal response of \033[0c for a 4 which means the terminal supports sixel.

The terminal dimensions in character cells and pixels have to be figured out to know how large a cell is in pixels and scale the image correctly. Unix terminals can be queried with the ioctl TIOCGWINSZ - I think this works only locally. The other way is to query with the escape codes \033[18t for dimensions in character cells and \033[14t for dimensions in pixels.

Escape codes can be passed through the terminal multiplexers tmux and screen - please check my PR on termui for that. Handling scrolling output for those is probably difficult.

hpjansson commented 5 years ago

Thanks, @srlehn. Chafa currently uses TIOCGWINSZ to get the size in character cells, and it works over ssh, in screen and in tmux, but not over a simple serial device. I don't think we can reliably get pixel dimensions that way, though (winsize.ws_xpixel and winsize.ws_ypixel come back as zero), so we'd have to use escape codes.

srlehn commented 5 years ago

Hi @hpjansson, thanks for letting me know that it works over ssh. Yes, you probably need to send the wrapped escape codes \033[14t and \033[18t to the terminal to figure out the character box size and not wrapped \033[18t or TIOCGWINSZ to tmux/screen to get the correct cell size for the current pane/window of the multiplexer. I don't remember what works for tmux/screen.

hpjansson commented 5 years ago

I spoke too quickly. .ws_xpixel and .ws_ypixel properly reflect the terminal's pixel dimensions for mlterm, which supports sixels, but not for VTE (gnome-terminal), which does not. Experimentally this works over ssh and in screen too, modulo potential issues with resizing. So maybe we can just rely on that for simplicity.

cdluminate commented 5 years ago

Another reference implementation for sixel support: https://github.com/jart/hiptext (this tool is quite similar to chafa)

hpjansson commented 4 years ago

Sixel support has been committed to master. Use chafa -f sixel. It will be used automatically in mlterm.

I wrote it from scratch, and it's the fastest implementation I'm aware of by far. There is still some refactoring, optimization and general cleanup to do, but it's working now, so I'll close this issue.

cdluminate commented 4 years ago

Good job!

ghost commented 4 years ago

FYI more terminals are supporting sixel recently:

More terminals and widgets with sixel support are reported at: https://gitlab.com/klamonte/jexer/wikis/terminals .

I wrote it from scratch, and it's the fastest implementation I'm aware of by far. There is still some refactoring, optimization and general cleanup to do, but it's working now, so I'll close this issue.

I haven't tried it yet, but peeking at the C it sure looks like it would be fast. I had to do quite a bit to get my Java encoder fast enough for the real world. Some of the tricks I resorted to are outlined at: https://jexer.sourceforge.io/evolution.html#sixel .

hpjansson commented 4 years ago

@klamonte I see you did some work on speeding up palette mapping. That's where I focused my efforts too. After some experimentation I settled on an approach using principal component analysis and a hash table.

When palette colors are sorted by their first principal component, you can do a binary search followed by a neighborhood probe for the closest match. This yields colors that are identical to a linear 3-component euclidean search, but instead of 256*3=768 multiplications (for a 256-color palette), it gets by with <16 multiplications typically. The mapping is stored in a flat 16384-entry hash table, so if the exact same color is seen in the near future, it will be satisfied by an inlined O(1) lookup.

The palette itself is just generated with sparse median cut. Nothing interesting there :) It could probably be sped up with parallel merge sort or maybe a fancier quantization algorithm.

The quantization/palette mapping code is spread across chafa-palette.c, chafa-color-table.c, chafa-color-hash.c, chafa-pca.c and the respective .h files.

Since the encoder has to do many passes per six-height row, it first converts the row into an efficient internal format, so the data needed for each column can be fetched from memory in one load instead of six. I also took pains to make the column-to-character converter branch-free and divided the row into multiple "banks" with a bloom filter so if a palette color does not appear in a run of say, 64 columns, those can be skipped over quickly.

See:

https://github.com/hpjansson/chafa/blob/f4305b7e1b76c68e88b65b1625503a90028cf4df/chafa/internal/chafa-sixel-canvas.c#L106-L191

If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading.

Feel free to e-mail me if you have any questions.

ghost commented 2 years ago

@hpjansson

I was disappeared for a while, but obviously I am swinging back around. I got my encoder a bit better, inspired by notcurses' new octree encoder.

I wanted to point out @dankamongmen 's work here, as you both have very interesting optimizations for speed. When I find time to do better than my mix of directly mapping colors (when the palette is smaller than the number of registers) and boring median-cut, I want to play with both of your approaches. :-)

hpjansson commented 2 years ago

Nice! I tried using octree at first, but I didn't like it (turned out slower than median cut for comparable quality) and discarded that attempt. The method I use is two-pass, so in theory I can use any algorithm for generating the palette (if you sample the image, this doesn't need to be that fast) and keep the fast color mapping in the second pass.

I'd like to do a side-by-side comparison of the existing sixel encoders at some point when I'm less swamped with work :)

dankamongmen commented 2 years ago

Nice! I tried using octree at first, but I didn't like it (turned out slower than median cut for comparable quality) and discarded that attempt. The method I use is two-pass, so in theory I can use any algorithm for generating the palette (if you sample the image, this doesn't need to be that fast) and keep the fast color mapping in the second pass.

to be precise, i use an octree-inspired algorithm. my first attempt at using octrees directly gave less-than-awesome results. when wondering why, my key insight was that octree's advantages come from a power-of-two-based spectrum (0..255 per channel). sixel, of course, is not a 2^n space, but rather 101 values. i then moved to a "decatree" configuration of my own, and got far better results. this is only viable if you're converting colors from 256->101 at the beginning rather than at the end (which is what i do, since i'm not doing any of the averaging commonly employed by mincut algorithms, and thus needn't preserve original precision through the calculation).

I'd like to do a side-by-side comparison of the existing sixel encoders at some point when I'm less swamped with work :)

https://github.com/dankamongmen/notcurses/discussions/1857 shows notcurses vs timg from last June. i've become much faster since then. you might want to take a look at quantanal.c from src/poc in a Notcurses checkout. it takes an arbitrary image file, decodes it to RGBA, encodes it to sixel, decodes that sixel (verifying correctness of the encoding) to RGBA, and compares the resulting RGBA to the original RGBA:

analyzing ../data/notcurses.png...
 Image too large, scaling to display
 source pixels: 320x1280 rendered: 528x880 1858560B
 control sequence: 418852 bytes (22.54%)
 00464640 pixels analyzed Δr 8680605 Δg 8884951 Δb 8094059
 130838px Δr 8680605 (66.3) Δg 8884951 (67.9) Δb 8094059 (61.9)
 avg diff per pixel: 196
done with ../data/notcurses.png in 149.454ms.
dankamongmen commented 2 years ago

If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading.

whoa whoa whoa, isn't this kinda picking your benchmarks? if you go with a slow decoder, that ought be represented in your timings. i use FFmpeg as my sole decoding backend, in part because imagemagick was slower with no discernible improvement in image quality (indeed, in my testing, with no difference in image quality whatsoever). unless you're strictly comparing sixel implementations, of course, but an end user only cares about time-to-display from start time.

dankamongmen commented 2 years ago

If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading.

whoa whoa whoa, isn't this kinda picking your benchmarks? if you go with a slow decoder, that ought be represented in your timings. i use FFmpeg as my sole decoding backend, in part because imagemagick was slower with no discernible improvement in image quality (indeed, in my testing, with no difference in image quality whatsoever). unless you're strictly comparing sixel implementations, of course, but an end user only cares about time-to-display from start time.

btw, this is why quantanal above shows both time-to-run and average error. i was unwilling to include any time optimization that lowered output image quality.

dankamongmen commented 2 years ago

The terminal dimensions in character cells and pixels have to be figured out to know how large a cell is in pixels and scale the image correctly. Unix terminals can be queried with the ioctl TIOCGWINSZ - I think this works only locally. The other way is to query with the escape codes \033[18t for dimensions in character cells and \033[14t for dimensions in pixels.

the ioctl works over remote ssh connections just fine. there's a specific SSH protocol data unit for resolution change notification.

ghost commented 2 years ago

The terminal dimensions in character cells and pixels have to be figured out to know how large a cell is in pixels and scale the image correctly. Unix terminals can be queried with the ioctl TIOCGWINSZ - I think this works only locally. The other way is to query with the escape codes \033[18t for dimensions in character cells and \033[14t for dimensions in pixels.

the ioctl works over remote ssh connections just fine. there's a specific SSH protocol data unit for resolution change notification.

It's in the protocol but not all ssh implementations honor it or expose it to client code. I had to manually patch cryptlib for it once upon a time, and it sucked because there were no good SSH alternatives for Windows that worked with Borland C++ or Visual Studio 6. Today I would use mbedtls.

dankamongmen commented 2 years ago

I wrote it from scratch, and it's the fastest implementation I'm aware of by far.

shots fired! =]

so far as i can tell, chafa -f sixel FILE and ncplayer -k -sscale FILE on a sixel-supporting terminal produce very similar output (without the -sscale, ncplayer will by default stretch to fill the entire terminal). i do note that ncplayer helpfully places the cursor at the end of the output, whereas chafa places it immediately below the invoking line (i.e. in the middle of the output).

....drum roll please...

[schwarzgerat](0) $ for i in `seq 1 5` ; do time chafa -f sixel ../data/notcurses.png  > /dev/null ; done

real    0m0.044s
user    0m0.025s
sys     0m0.098s

real    0m0.047s
user    0m0.062s
sys     0m0.063s

real    0m0.049s
user    0m0.046s
sys     0m0.084s

real    0m0.067s
user    0m0.070s
sys     0m0.178s

real    0m0.051s
user    0m0.129s
sys     0m0.053s
[schwarzgerat](0) $ for i in `seq 1 5` ; do time ./ncplayer ../data/notcurses.png -k -sscale > /dev/null ; done

real    0m0.047s
user    0m0.040s
sys     0m0.008s

real    0m0.049s
user    0m0.041s
sys     0m0.008s

real    0m0.051s
user    0m0.036s
sys     0m0.016s

real    0m0.053s
user    0m0.041s
sys     0m0.012s

real    0m0.055s
user    0m0.042s
sys     0m0.012s
[schwarzgerat](0) $ 

given that you have several user/sys times that exceed your real time, i'm guessing you've got a multithreaded solution? i do not yet employ any threading in my decodes, but am considering adding some. in any case, we've got chafa:

.044, .047, .049, .067, .051

and ncplayer:

.047, .049, .051, .053, .055

your min beats mine by .003, my max beats yours by .012, and my average beats yours by .0006, whew photo finish! i'd ascribe all these deltas to noise, but "significantly faster than all other implementations" no longer seems a valid claim IMHO =].

littlemac

dankamongmen commented 2 years ago

given that you have several user/sys times that exceed your real time, i'm guessing you've got a multithreaded solution? i do not yet employ any threading in my decodes, but am considering adding some. in any case, we've got chafa:

if you're doing a blind thread-per-core implementation, know that i tested this on an AMD 3970X (32 physical/64 logical cores), and you'd probably have less overhead on a smaller machine. questions like this are why i haven't yet moved to a threaded solution; were i to do so, i'd likely have a thread-per-some-unit-area-that-is-a-mutliple-of-page-size, capped by the number of logical cores, using a fractal spawnout to parallelize initial spawning overheads.

ghost commented 2 years ago

We have two incredible winners here!

I feel like I'm watching one of the great US Open matches with Andre Agassi vs Pete Sampras.

This is awesome. :)

dankamongmen commented 2 years ago

Nice! I tried using octree at first, but I didn't like it (turned out slower than median cut for comparable quality) and discarded that attempt. The method I use is two-pass, so in theory I can use any algorithm for generating the palette (if you sample the image, this doesn't need to be that fast) and keep the fast color mapping in the second pass.

two comments here:

hpjansson commented 2 years ago

@dankamongmen The reason I specified GIF is that the decoder is simple and bypasses ImageMagick, so you get closer to benchmarking the sixel encoder and not some random image loader (e.g. libpng). ImageMagick also adds tons of overhead. I bet your comparison above is mostly benchmarking the loader. Of course, that doesn't mean Chafa is faster, just that the benchmark is inconclusive :)

Ideally you'd pass in raw memory buffers, but you'd have to write a small wrapper that loads the image once and measures around a loop of encoder calls.

Re. threading: I've pipelined the threads so a single row of image data goes through the various passes from start to finish where possible, but there's still some low-hanging fruit there. It typically gets working sets smaller than L1 cache; though I've only checked the scaling code thoroughly for memory bottlenecks (it's the smolscale entrant in this writeup). The threads work on independent buffers so there isn't any locking except to synchronize between full passes/to wait for the last batch to finish. (Edit: As for quantization, color mapping is threaded, but the palette is generated in a single-threaded first pass).

i do note that ncplayer helpfully places the cursor at the end of the output, whereas chafa places it immediately below the invoking line (i.e. in the middle of the output).

That shouldn't be happening. Which Chafa version and terminal are you using?

chafa -f sixel FILE and ncplayer -k -sscale FILE on a sixel-supporting terminal produce very similar output (without the -sscale, ncplayer will by default stretch to fill the entire terminal

When invoked like that, Chafa should be stretching the image also, as far as it can while preserving aspect. Though if you're running an older distro version it might not when stdout is being redirected.

ghost commented 2 years ago

Hey guys, I just had a terrible idea. ;-)

I just found out that sequences exist to query the sixel palette, in both HSL and RGB space. This isn't useful for single images, but for videos: what might it look like, and how fast could it be (average FPS, after the initial setup is ready) if one grabbed the palette that was there already and blindly mapped to it? (For reference, the default VT340 16-color palette is this: https://github.com/hackerb9/vt340test/blob/main/colormap/showcolortable.png)

No initial color setup, just using the best-fit of 16-256 colors. I wonder if that would lead to a cool fuzzy-analog-TV type effect?

🤔

j4james commented 2 years ago

NB: As I pointed out in wez/wezterm#217, technically DECCTR reports the text color table, and not the sixel color table (unless they're the same thing). So unless you're writing software specifically for the VT340, I don't think it's going to be of much use to you.

ghost commented 2 years ago

Thank you for the clarification!

dankamongmen commented 2 years ago

chafa -f sixel FILE and ncplayer -k -sscale FILE on a sixel-supporting terminal produce very similar output (without the -sscale, ncplayer will by default stretch to fill the entire terminal When invoked like that, Chafa should be stretching the image also, as far as it can while preserving aspect. Though if you're running an older distro version it might not when stdout is being redirected.

i'm using chafa 1.8.0-1 as packaged in Debian Unstable, and i ran these tests in an XTerm (patch 370).

i should correct my claim: chafa does not leave the cursor in the middle of the output, necessarily, but rather always prints the sixel at the top of the terminal, and places the cursor one line below where chafa was launched. it definitely is not stretching in the sense that ncplayer does by default.

2022-01-23-145410_884x556_scrot

dankamongmen commented 2 years ago

@dankamongmen The reason I specified GIF is that the decoder is simple and bypasses ImageMagick, so you get closer to benchmarking the sixel encoder and not some random image loader (e.g. libpng). ImageMagick also adds tons of overhead. I bet your comparison above is mostly benchmarking the loader. Of course, that doesn't mean Chafa is faster, just that the benchmark is inconclusive :)

i welcome your counterexample, good sir.

dankamongmen commented 2 years ago

I just found out that sequences exist to query the sixel palette, in both HSL and RGB space. This isn't useful for single images, but for videos: what might it look like, and how fast could it be (average FPS, after the initial setup is ready) if one grabbed the palette that was there already and blindly mapped to it? (For reference, the default VT340 16-color palette is this: https://github.com/hackerb9/vt340test/blob/main/colormap/showcolortable.png)

i definitely refresh the palette for each frame of a multiframe visual (this is something that's trivially amenable to multithreading, and i do exactly that in e.g. the [xray] demo of notcurses-demo). i can't imagine anything else being safe. how is that going to work on, say, The Wizard of Oz?

in any case, if you've already handled one frame, you already have the palette available to you without needing request it from the terminal, no? assuming this is all happening in one context.

ghost commented 2 years ago

The idea (which won't work anyway except on a VT340 it seems) is to never generate a per-image palette. So it will definitely look different/bad, but might be a bit faster if the mapping/dithering can still work quickly.

As to why: maybe faster multihead. :-) But my performance awfulness there could be more from thread contention.

But then reading through you two brought to mind an embarrassingly parallel part of Jexer that is not actually parallel.

dankamongmen commented 2 years ago

@hpjansson i have found a class of images on which you outperform me for sure, so be pleased with that. on the other hand, i have found a class of images on which you outperform me for sure, so don't expect that to remain the case too long =D

hpjansson commented 2 years ago

@hpjansson i have found a class of images on which you outperform me for sure, so be pleased with that. on the other hand, i have found a class of images on which you outperform me for sure, so don't expect that to remain the case too long =D

Hey, I paid good money for access to those papers, it'd better be better at something :)

On big images, a lot of it will come down to multithreading. With many colors it'll likely be the PCA color mapping pulling ahead. With sparse or repeating colors, the bloom filters and branch-free/SWAR code helps.

Guessing you already have the image scaling figured out, but if you're interested, smolscale is written to be easily copyable and usable verbatim. It's as fast as I could make it with separable convolution. If you need license concessions we can look at that.

I'm still interested in getting a performance delta without the libpng overhead. Since it adds a fixed overhead to both implementations, the difference in the remaining slice is likely greater than 2x. Do you have a small code snippet that shows how to use notcurses API to convert an RGB buffer to sixels in direct mode?

hpjansson commented 2 years ago

@dankamongmen You're right that Chafa's sixel output behaves oddly in XTerm, by the way. It used to work in older versions of XTerm, but when I updated it it started acting weird. XTerm's decoder also seems awfully slow. However, works fine in mlterm and VTE (from master with sixels enabled). Those are also quite fast.

Thinking there's either a bug in XTerm and/or a disagreement on standards interpretation. If I do e.g. chafa *.jpg, forcing it to scroll, I get the images nicely laid out one below the other. But if I try to print a single image I get nothing visible at all, not even in the top left corner.

ghost commented 2 years ago

@hpjansson I'm looking at an xterm sixel bug too: https://gitlab.com/klamonte/jexer/-/issues/89

My root cause might be related to the slowish sixel decoder. 🤷‍♀️

hpjansson commented 2 years ago

@dankamongmen Chafa seems to work correctly with XTerm 370 here. Images appear inline and scroll as you'd expect. Since I'm not emitting DECSDM (I think it's good practice for a command-line tool to mess as little as possible with terminal state; going by the recent discussions, this seems to be analogous to how lsix and convert do it), is it possible that scrolling had been disabled in your terminal prior to running Chafa? Maybe Notcurses left it disabled?

hpjansson commented 2 years ago

@hpjansson I'm looking at an xterm sixel bug too: https://gitlab.com/klamonte/jexer/-/issues/89

My root cause might be related to the slowish sixel decoder. woman_shrugging

Just looking at it from afar, so maybe I'm wrong, but it feels like a typical event loop problem where the highest-priority event source is preventing other sources from getting any time slices. Confirmed it here with big sixel animations -- stuck parsing, never gets around to drawing anything.

Edit: This seems unique to XTerm. For instance, mlterm also bottlenecks on 4k video, but it seems to split its time ok between parsing and updating.

dankamongmen commented 2 years ago

@dankamongmen Chafa seems to work correctly with XTerm 370 here. Images appear inline and scroll as you'd expect. Since I'm not emitting DECSDM (I think it's good practice for a command-line tool to mess as little as possible with terminal state; going by the recent discussions, this seems to be analogous to how lsix and convert do it), is it possible that scrolling had been disabled in your terminal prior to running Chafa? Maybe Notcurses left it disabled?

completely possible. this can also be controlled by your XTerm configuration using the sixelScrolling resource.

hpjansson commented 2 years ago

By the way, you can make the loader take up less of the benchmarking time by just generating really big output images. After maximizing the terminal on a 4k screen, I get this:

$ for i in $(seq 1 5); do time ~/work/vcs/chafa.git/tools/chafa/chafa -f sixel data/notcurses.png >/dev/null; done

real    0m0.073s
user    0m0.201s
sys 0m0.014s

real    0m0.074s
user    0m0.210s
sys 0m0.008s

real    0m0.077s
user    0m0.213s
sys 0m0.005s

real    0m0.074s
user    0m0.167s
sys 0m0.041s

real    0m0.075s
user    0m0.177s
sys 0m0.033s

$ for i in $(seq 1 5); do time ./ncplayer -b pixel -k -sscale data/notcurses.png  >/dev/null; done

real    0m0.463s
user    0m0.423s
sys 0m0.028s

real    0m0.467s
user    0m0.405s
sys 0m0.052s

real    0m0.465s
user    0m0.399s
sys 0m0.056s

real    0m0.466s
user    0m0.415s
sys 0m0.039s

real    0m0.458s
user    0m0.404s
sys 0m0.044s

So Chafa's sixel encoder is slightly north of 6x (six times) faster than Notcurses' when the actual sixel code dominates the benchmark. Single-thread performance is about 2x. Intel Haswell 4/8 core.

The PNG loader is still adding overhead, so if I set up a test harness with raw image buffers, it should increase the delta a bit more.

dankamongmen commented 2 years ago

yep, the AllRGB images i'm testing with are 4096x4096 with each RGB color used exactly once. i'm adding some threading now, and expect to do significantly better soon on such large images. btw, how many cores on the machine you ran those tests on?

hpjansson commented 2 years ago

4 cores physical, 8 logical. Note that the AllRGB images are huge, so you're still mostly benchmarking the loader. To get a good idea of the encoder's performance, the output image must be (much) bigger than the input image. Since AllRGB images must be 16MP to contain all the colors, I think the only way to get a true benchmark would be to load the image first and put a timer around the encoder calls.

hpjansson commented 2 years ago

Note the output image must be big. When using AllRGB 4096x4096 images as input you're doing the inverse and spending even more time in libpng/ImageMagick.

ghost commented 2 years ago

@hpjansson

I have gotten a lot of gains, primarily in quality but also just in general structure and performance, from your outline of chafa's approach. Thank you so much for detailing it here!

When palette colors are sorted by their first principal component, you can do a binary search followed by a neighborhood probe for the closest match. This yields colors that are identical to a linear 3-component euclidean search, but instead of 256*3=768 multiplications (for a 256-color palette), it gets by with <16 multiplications typically.

This was such a fun trip to do! :-)

Putting in the eigenvalue solver (I used this one, which is just a C repackaging of JAMA's Java repackaging of EISPACK, all of which are public domain) really warmed the cockles of my black computational chemist heart. Pop in the covariance matrix calc, get the eigenvalues/vectors, and go to town with the binary search.

Thank you also for the pointer on the neighborhood search. I was close-but-not-quite there, and when I put just the two other minor checks in it was like "Woah! That is SO pretty!" :)

The mapping is stored in a flat 16384-entry hash table, so if the exact same color is seen in the near future, it will be satisfied by an inlined O(1) lookup.

I cut about 40%-ish color-matching time by hanging onto the previous binary search answer, and checking the difference in pca1 on the next try: if it was within roughly 8 indices just start from there. Not quite O(1), but still a boost.

Then I added a front-end 8192-entry hash table, and got another boost.

The palette itself is just generated with sparse median cut. Nothing interesting there :) It could probably be sped up with parallel merge sort or maybe a fancier quantization algorithm.

Palette generation is still about 5% of the total time, but I'm not unhappy. I just sample num_colors in uniform 16-pixel runs, and was really surprised how effective that is.

Since the encoder has to do many passes per six-height row, it first converts the row into an efficient internal format, so the data needed for each column can be fetched from memory in one load instead of six. I also took pains to make the column-to-character converter branch-free and divided the row into multiple "banks" with a bloom filter so if a palette color does not appear in a run of say, 64 columns, those can be skipped over quickly.

This is the next frontier for the single-thread case. Having cleaned up a lot of crap around it, those 6 memory loads stick out so much and uuugggh lol.

If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading.

Funny enough, that's exactly how I got here. :-) I want gaming and videos to work, so gotta get a lot faster for it.

Performance gains in the large so far have come from the following:

Thanks to your pointers, after a week or so I think I'm about 4x better on image quality, and 2-3x faster now. Still really far behind both of you here: on my puny i3-7100U @ 2.40GHz it's about 20 seconds for the AllRGB balloon.png :

Act. scan 0.9101s      map 0.0843s     dither 13.5866s emit 5.8524s
Act. scan 1.0277s      map 0.0711s     dither 13.4412s emit 5.5109s
Act. scan 0.8645s      map 0.0661s     dither 13.4089s emit 6.8736s
Act. scan 0.9389s      map 0.0603s     dither 13.4553s emit 6.8573s
Act. scan 0.9412s      map 0.0561s     dither 13.4151s emit 6.7967s
Act. scan 0.8742s      map 0.0650s     dither 13.4132s emit 6.8243s

That appears to be after the JIT has done what it can. On smaller images, you can see it stepping in and the speedup:

 Act. scan 0.1206s      map 0.0368s     dither 0.7361s  emit 1.0289s  --> 1.9 s
 Act. scan 0.0562s      map 0.0137s     dither 0.6473s  emit 0.5490s
 Act. scan 0.0698s      map 0.0171s     dither 0.6141s  emit 0.4027s
 Act. scan 0.0867s      map 0.0147s     dither 0.6279s  emit 0.4538s
 Act. scan 0.0868s      map 0.0148s     dither 0.6222s  emit 0.4592s
 Act. scan 0.0566s      map 0.0085s     dither 0.6174s  emit 0.4333s
 Act. scan 0.0532s      map 0.0090s     dither 0.6511s  emit 0.4191s
 Act. scan 0.0531s      map 0.0077s     dither 0.6112s  emit 0.4105s
 Act. scan 0.0650s      map 0.0077s     dither 0.6112s  emit 0.4499s
 Act. scan 0.0859s      map 0.0090s     dither 0.6146s  emit 0.4072s
 Act. scan 0.0643s      map 0.0077s     dither 0.6115s  emit 0.4268s  --> 1.1 s

So I think that's like 1% of your speed? 🤣🤣