Closed cdluminate closed 4 years ago
Yes, that's useful. Thanks. Another interesting project is libsixel: https://github.com/saitoha/libsixel
Sixels isn't that hard to do -- the hard part is the design/API decisions that need to be made to make it fit as seamlessly as possible alongside character-cell-based graphics. There is some research to be done too, like how feasible would it be to mix sixels and character cells, e.g. what do the various terminals do with existing sixel graphics when you change the font size or overdraw it with text?
I'll need to look at it more carefully, but I think the best way to implement sixel support in Chafa may be to do it directly in the frontend and leave libchafa to deal with character-cell graphics only, since it's already built around a character-cell API. Maybe.
Hi, I am currently implementing sixel support for a TUI https://github.com/gizak/termui/pull/233.
I first check the terminal response of \033[0c for a 4 which means the terminal supports sixel.
The terminal dimensions in character cells and pixels have to be figured out to know how large a cell is in pixels and scale the image correctly. Unix terminals can be queried with the ioctl TIOCGWINSZ - I think this works only locally. The other way is to query with the escape codes \033[18t for dimensions in character cells and \033[14t for dimensions in pixels.
Escape codes can be passed through the terminal multiplexers tmux and screen - please check my PR on termui for that. Handling scrolling output for those is probably difficult.
Thanks, @srlehn. Chafa currently uses TIOCGWINSZ to get the size in character cells, and it works over ssh, in screen and in tmux, but not over a simple serial device. I don't think we can reliably get pixel dimensions that way, though (winsize.ws_xpixel and winsize.ws_ypixel come back as zero), so we'd have to use escape codes.
Hi @hpjansson, thanks for letting me know that it works over ssh. Yes, you probably need to send the wrapped escape codes \033[14t and \033[18t to the terminal to figure out the character box size and not wrapped \033[18t or TIOCGWINSZ to tmux/screen to get the correct cell size for the current pane/window of the multiplexer. I don't remember what works for tmux/screen.
I spoke too quickly. .ws_xpixel and .ws_ypixel properly reflect the terminal's pixel dimensions for mlterm, which supports sixels, but not for VTE (gnome-terminal), which does not. Experimentally this works over ssh and in screen too, modulo potential issues with resizing. So maybe we can just rely on that for simplicity.
Another reference implementation for sixel support: https://github.com/jart/hiptext (this tool is quite similar to chafa)
Sixel support has been committed to master. Use chafa -f sixel
. It will be used automatically in mlterm.
I wrote it from scratch, and it's the fastest implementation I'm aware of by far. There is still some refactoring, optimization and general cleanup to do, but it's working now, so I'll close this issue.
Good job!
FYI more terminals are supporting sixel recently:
vte (gnome-terminal and more) and xterm.js are in process.
mintty just wrapped some bugs and is looking good, so Windows users have another option besides mlterm and RLogin.
DomTerm's sixel support got much better.
More terminals and widgets with sixel support are reported at: https://gitlab.com/klamonte/jexer/wikis/terminals .
I wrote it from scratch, and it's the fastest implementation I'm aware of by far. There is still some refactoring, optimization and general cleanup to do, but it's working now, so I'll close this issue.
I haven't tried it yet, but peeking at the C it sure looks like it would be fast. I had to do quite a bit to get my Java encoder fast enough for the real world. Some of the tricks I resorted to are outlined at: https://jexer.sourceforge.io/evolution.html#sixel .
@klamonte I see you did some work on speeding up palette mapping. That's where I focused my efforts too. After some experimentation I settled on an approach using principal component analysis and a hash table.
When palette colors are sorted by their first principal component, you can do a binary search followed by a neighborhood probe for the closest match. This yields colors that are identical to a linear 3-component euclidean search, but instead of 256*3=768 multiplications (for a 256-color palette), it gets by with <16 multiplications typically. The mapping is stored in a flat 16384-entry hash table, so if the exact same color is seen in the near future, it will be satisfied by an inlined O(1) lookup.
The palette itself is just generated with sparse median cut. Nothing interesting there :) It could probably be sped up with parallel merge sort or maybe a fancier quantization algorithm.
The quantization/palette mapping code is spread across chafa-palette.c, chafa-color-table.c, chafa-color-hash.c, chafa-pca.c and the respective .h files.
Since the encoder has to do many passes per six-height row, it first converts the row into an efficient internal format, so the data needed for each column can be fetched from memory in one load instead of six. I also took pains to make the column-to-character converter branch-free and divided the row into multiple "banks" with a bloom filter so if a palette color does not appear in a run of say, 64 columns, those can be skipped over quickly.
See:
If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading.
Feel free to e-mail me if you have any questions.
@hpjansson
I was disappeared for a while, but obviously I am swinging back around. I got my encoder a bit better, inspired by notcurses' new octree encoder.
I wanted to point out @dankamongmen 's work here, as you both have very interesting optimizations for speed. When I find time to do better than my mix of directly mapping colors (when the palette is smaller than the number of registers) and boring median-cut, I want to play with both of your approaches. :-)
Nice! I tried using octree at first, but I didn't like it (turned out slower than median cut for comparable quality) and discarded that attempt. The method I use is two-pass, so in theory I can use any algorithm for generating the palette (if you sample the image, this doesn't need to be that fast) and keep the fast color mapping in the second pass.
I'd like to do a side-by-side comparison of the existing sixel encoders at some point when I'm less swamped with work :)
Nice! I tried using octree at first, but I didn't like it (turned out slower than median cut for comparable quality) and discarded that attempt. The method I use is two-pass, so in theory I can use any algorithm for generating the palette (if you sample the image, this doesn't need to be that fast) and keep the fast color mapping in the second pass.
to be precise, i use an octree-inspired algorithm. my first attempt at using octrees directly gave less-than-awesome results. when wondering why, my key insight was that octree's advantages come from a power-of-two-based spectrum (0..255 per channel). sixel, of course, is not a 2^n space, but rather 101 values. i then moved to a "decatree" configuration of my own, and got far better results. this is only viable if you're converting colors from 256->101 at the beginning rather than at the end (which is what i do, since i'm not doing any of the averaging commonly employed by mincut algorithms, and thus needn't preserve original precision through the calculation).
I'd like to do a side-by-side comparison of the existing sixel encoders at some point when I'm less swamped with work :)
https://github.com/dankamongmen/notcurses/discussions/1857 shows notcurses vs timg from last June. i've become much faster since then. you might want to take a look at quantanal.c
from src/poc
in a Notcurses checkout. it takes an arbitrary image file, decodes it to RGBA, encodes it to sixel, decodes that sixel (verifying correctness of the encoding) to RGBA, and compares the resulting RGBA to the original RGBA:
analyzing ../data/notcurses.png...
Image too large, scaling to display
source pixels: 320x1280 rendered: 528x880 1858560B
control sequence: 418852 bytes (22.54%)
00464640 pixels analyzed Δr 8680605 Δg 8884951 Δb 8094059
130838px Δr 8680605 (66.3) Δg 8884951 (67.9) Δb 8094059 (61.9)
avg diff per pixel: 196
done with ../data/notcurses.png in 149.454ms.
If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading.
whoa whoa whoa, isn't this kinda picking your benchmarks? if you go with a slow decoder, that ought be represented in your timings. i use FFmpeg as my sole decoding backend, in part because imagemagick was slower with no discernible improvement in image quality (indeed, in my testing, with no difference in image quality whatsoever). unless you're strictly comparing sixel implementations, of course, but an end user only cares about time-to-display from start time.
If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading.
whoa whoa whoa, isn't this kinda picking your benchmarks? if you go with a slow decoder, that ought be represented in your timings. i use FFmpeg as my sole decoding backend, in part because imagemagick was slower with no discernible improvement in image quality (indeed, in my testing, with no difference in image quality whatsoever). unless you're strictly comparing sixel implementations, of course, but an end user only cares about time-to-display from start time.
btw, this is why quantanal
above shows both time-to-run and average error. i was unwilling to include any time optimization that lowered output image quality.
The terminal dimensions in character cells and pixels have to be figured out to know how large a cell is in pixels and scale the image correctly. Unix terminals can be queried with the ioctl TIOCGWINSZ - I think this works only locally. The other way is to query with the escape codes \033[18t for dimensions in character cells and \033[14t for dimensions in pixels.
the ioctl works over remote ssh connections just fine. there's a specific SSH protocol data unit for resolution change notification.
The terminal dimensions in character cells and pixels have to be figured out to know how large a cell is in pixels and scale the image correctly. Unix terminals can be queried with the ioctl TIOCGWINSZ - I think this works only locally. The other way is to query with the escape codes \033[18t for dimensions in character cells and \033[14t for dimensions in pixels.
the ioctl works over remote ssh connections just fine. there's a specific SSH protocol data unit for resolution change notification.
It's in the protocol but not all ssh implementations honor it or expose it to client code. I had to manually patch cryptlib for it once upon a time, and it sucked because there were no good SSH alternatives for Windows that worked with Borland C++ or Visual Studio 6. Today I would use mbedtls.
I wrote it from scratch, and it's the fastest implementation I'm aware of by far.
shots fired! =]
so far as i can tell, chafa -f sixel FILE
and ncplayer -k -sscale FILE
on a sixel-supporting terminal produce very similar output (without the -sscale
, ncplayer
will by default stretch to fill the entire terminal). i do note that ncplayer
helpfully places the cursor at the end of the output, whereas chafa places it immediately below the invoking line (i.e. in the middle of the output).
....drum roll please...
[schwarzgerat](0) $ for i in `seq 1 5` ; do time chafa -f sixel ../data/notcurses.png > /dev/null ; done
real 0m0.044s
user 0m0.025s
sys 0m0.098s
real 0m0.047s
user 0m0.062s
sys 0m0.063s
real 0m0.049s
user 0m0.046s
sys 0m0.084s
real 0m0.067s
user 0m0.070s
sys 0m0.178s
real 0m0.051s
user 0m0.129s
sys 0m0.053s
[schwarzgerat](0) $ for i in `seq 1 5` ; do time ./ncplayer ../data/notcurses.png -k -sscale > /dev/null ; done
real 0m0.047s
user 0m0.040s
sys 0m0.008s
real 0m0.049s
user 0m0.041s
sys 0m0.008s
real 0m0.051s
user 0m0.036s
sys 0m0.016s
real 0m0.053s
user 0m0.041s
sys 0m0.012s
real 0m0.055s
user 0m0.042s
sys 0m0.012s
[schwarzgerat](0) $
given that you have several user/sys times that exceed your real time, i'm guessing you've got a multithreaded solution? i do not yet employ any threading in my decodes, but am considering adding some. in any case, we've got chafa:
.044, .047, .049, .067, .051
and ncplayer:
.047, .049, .051, .053, .055
your min beats mine by .003, my max beats yours by .012, and my average beats yours by .0006, whew photo finish! i'd ascribe all these deltas to noise, but "significantly faster than all other implementations" no longer seems a valid claim IMHO =].
given that you have several user/sys times that exceed your real time, i'm guessing you've got a multithreaded solution? i do not yet employ any threading in my decodes, but am considering adding some. in any case, we've got chafa:
if you're doing a blind thread-per-core implementation, know that i tested this on an AMD 3970X (32 physical/64 logical cores), and you'd probably have less overhead on a smaller machine. questions like this are why i haven't yet moved to a threaded solution; were i to do so, i'd likely have a thread-per-some-unit-area-that-is-a-mutliple-of-page-size, capped by the number of logical cores, using a fractal spawnout to parallelize initial spawning overheads.
We have two incredible winners here!
I feel like I'm watching one of the great US Open matches with Andre Agassi vs Pete Sampras.
This is awesome. :)
Nice! I tried using octree at first, but I didn't like it (turned out slower than median cut for comparable quality) and discarded that attempt. The method I use is two-pass, so in theory I can use any algorithm for generating the palette (if you sample the image, this doesn't need to be that fast) and keep the fast color mapping in the second pass.
two comments here:
i originally went with a single pass, drawn by the allure of cutting out the O(P) second pass entirely. i don't think it's really possible to do a single pass more quickly than a non-pathological two-pass encoder so long as memory access time >> arithmetic time, as it does on all modern processors. the reason is that any refinement or merging after initial color distribution (histogram creation, for a mincut implementation) has to touch all affected pixels, and you're likely refining areas covering the most pixels, and if the same pixel is involved in more than one refinement/merge, you have to hit it twice and have lost the game to a two-pass encoder. without at least that much refinement, you end up with a crappy encoding (leaving aside e.g. radix sort on the input space, requiring as it would at least 10^6 or 2^24 elements in the worst case). so two-pass seems the way to go, as further evidenced by its use almost everywhere.
thinking of how mincut and N-tree work, i wonder whether the latter looked artificially worse due to your threaded decode (and necessary locking thereby implied)? i haven't thought this all the way through yet; it might be nonsense.
@dankamongmen The reason I specified GIF is that the decoder is simple and bypasses ImageMagick, so you get closer to benchmarking the sixel encoder and not some random image loader (e.g. libpng). ImageMagick also adds tons of overhead. I bet your comparison above is mostly benchmarking the loader. Of course, that doesn't mean Chafa is faster, just that the benchmark is inconclusive :)
Ideally you'd pass in raw memory buffers, but you'd have to write a small wrapper that loads the image once and measures around a loop of encoder calls.
Re. threading: I've pipelined the threads so a single row of image data goes through the various passes from start to finish where possible, but there's still some low-hanging fruit there. It typically gets working sets smaller than L1 cache; though I've only checked the scaling code thoroughly for memory bottlenecks (it's the smolscale
entrant in this writeup). The threads work on independent buffers so there isn't any locking except to synchronize between full passes/to wait for the last batch to finish. (Edit: As for quantization, color mapping is threaded, but the palette is generated in a single-threaded first pass).
i do note that ncplayer helpfully places the cursor at the end of the output, whereas chafa places it immediately below the invoking line (i.e. in the middle of the output).
That shouldn't be happening. Which Chafa version and terminal are you using?
chafa -f sixel FILE and ncplayer -k -sscale FILE on a sixel-supporting terminal produce very similar output (without the -sscale, ncplayer will by default stretch to fill the entire terminal
When invoked like that, Chafa should be stretching the image also, as far as it can while preserving aspect. Though if you're running an older distro version it might not when stdout is being redirected.
Hey guys, I just had a terrible idea. ;-)
I just found out that sequences exist to query the sixel palette, in both HSL and RGB space. This isn't useful for single images, but for videos: what might it look like, and how fast could it be (average FPS, after the initial setup is ready) if one grabbed the palette that was there already and blindly mapped to it? (For reference, the default VT340 16-color palette is this: https://github.com/hackerb9/vt340test/blob/main/colormap/showcolortable.png)
No initial color setup, just using the best-fit of 16-256 colors. I wonder if that would lead to a cool fuzzy-analog-TV type effect?
🤔
NB: As I pointed out in wez/wezterm#217, technically DECCTR
reports the text color table, and not the sixel color table (unless they're the same thing). So unless you're writing software specifically for the VT340, I don't think it's going to be of much use to you.
Thank you for the clarification!
chafa -f sixel FILE and ncplayer -k -sscale FILE on a sixel-supporting terminal produce very similar output (without the -sscale, ncplayer will by default stretch to fill the entire terminal When invoked like that, Chafa should be stretching the image also, as far as it can while preserving aspect. Though if you're running an older distro version it might not when stdout is being redirected.
i'm using chafa 1.8.0-1 as packaged in Debian Unstable, and i ran these tests in an XTerm (patch 370).
i should correct my claim: chafa does not leave the cursor in the middle of the output, necessarily, but rather always prints the sixel at the top of the terminal, and places the cursor one line below where chafa was launched. it definitely is not stretching in the sense that ncplayer does by default.
@dankamongmen The reason I specified GIF is that the decoder is simple and bypasses ImageMagick, so you get closer to benchmarking the sixel encoder and not some random image loader (e.g. libpng). ImageMagick also adds tons of overhead. I bet your comparison above is mostly benchmarking the loader. Of course, that doesn't mean Chafa is faster, just that the benchmark is inconclusive :)
i welcome your counterexample, good sir.
I just found out that sequences exist to query the sixel palette, in both HSL and RGB space. This isn't useful for single images, but for videos: what might it look like, and how fast could it be (average FPS, after the initial setup is ready) if one grabbed the palette that was there already and blindly mapped to it? (For reference, the default VT340 16-color palette is this: https://github.com/hackerb9/vt340test/blob/main/colormap/showcolortable.png)
i definitely refresh the palette for each frame of a multiframe visual (this is something that's trivially amenable to multithreading, and i do exactly that in e.g. the [xray]
demo of notcurses-demo
). i can't imagine anything else being safe. how is that going to work on, say, The Wizard of Oz?
in any case, if you've already handled one frame, you already have the palette available to you without needing request it from the terminal, no? assuming this is all happening in one context.
The idea (which won't work anyway except on a VT340 it seems) is to never generate a per-image palette. So it will definitely look different/bad, but might be a bit faster if the mapping/dithering can still work quickly.
As to why: maybe faster multihead. :-) But my performance awfulness there could be more from thread contention.
But then reading through you two brought to mind an embarrassingly parallel part of Jexer that is not actually parallel.
@hpjansson i have found a class of images on which you outperform me for sure, so be pleased with that. on the other hand, i have found a class of images on which you outperform me for sure, so don't expect that to remain the case too long =D
@hpjansson i have found a class of images on which you outperform me for sure, so be pleased with that. on the other hand, i have found a class of images on which you outperform me for sure, so don't expect that to remain the case too long =D
Hey, I paid good money for access to those papers, it'd better be better at something :)
On big images, a lot of it will come down to multithreading. With many colors it'll likely be the PCA color mapping pulling ahead. With sparse or repeating colors, the bloom filters and branch-free/SWAR code helps.
Guessing you already have the image scaling figured out, but if you're interested, smolscale is written to be easily copyable and usable verbatim. It's as fast as I could make it with separable convolution. If you need license concessions we can look at that.
I'm still interested in getting a performance delta without the libpng overhead. Since it adds a fixed overhead to both implementations, the difference in the remaining slice is likely greater than 2x. Do you have a small code snippet that shows how to use notcurses API to convert an RGB buffer to sixels in direct mode?
@dankamongmen You're right that Chafa's sixel output behaves oddly in XTerm, by the way. It used to work in older versions of XTerm, but when I updated it it started acting weird. XTerm's decoder also seems awfully slow. However, works fine in mlterm and VTE (from master with sixels enabled). Those are also quite fast.
Thinking there's either a bug in XTerm and/or a disagreement on standards interpretation. If I do e.g. chafa *.jpg
, forcing it to scroll, I get the images nicely laid out one below the other. But if I try to print a single image I get nothing visible at all, not even in the top left corner.
@hpjansson I'm looking at an xterm sixel bug too: https://gitlab.com/klamonte/jexer/-/issues/89
My root cause might be related to the slowish sixel decoder. 🤷♀️
@dankamongmen Chafa seems to work correctly with XTerm 370 here. Images appear inline and scroll as you'd expect. Since I'm not emitting DECSDM (I think it's good practice for a command-line tool to mess as little as possible with terminal state; going by the recent discussions, this seems to be analogous to how lsix
and convert
do it), is it possible that scrolling had been disabled in your terminal prior to running Chafa? Maybe Notcurses left it disabled?
@hpjansson I'm looking at an xterm sixel bug too: https://gitlab.com/klamonte/jexer/-/issues/89
My root cause might be related to the slowish sixel decoder. woman_shrugging
Just looking at it from afar, so maybe I'm wrong, but it feels like a typical event loop problem where the highest-priority event source is preventing other sources from getting any time slices. Confirmed it here with big sixel animations -- stuck parsing, never gets around to drawing anything.
Edit: This seems unique to XTerm. For instance, mlterm also bottlenecks on 4k video, but it seems to split its time ok between parsing and updating.
@dankamongmen Chafa seems to work correctly with XTerm 370 here. Images appear inline and scroll as you'd expect. Since I'm not emitting DECSDM (I think it's good practice for a command-line tool to mess as little as possible with terminal state; going by the recent discussions, this seems to be analogous to how
lsix
andconvert
do it), is it possible that scrolling had been disabled in your terminal prior to running Chafa? Maybe Notcurses left it disabled?
completely possible. this can also be controlled by your XTerm configuration using the sixelScrolling
resource.
By the way, you can make the loader take up less of the benchmarking time by just generating really big output images. After maximizing the terminal on a 4k screen, I get this:
$ for i in $(seq 1 5); do time ~/work/vcs/chafa.git/tools/chafa/chafa -f sixel data/notcurses.png >/dev/null; done
real 0m0.073s
user 0m0.201s
sys 0m0.014s
real 0m0.074s
user 0m0.210s
sys 0m0.008s
real 0m0.077s
user 0m0.213s
sys 0m0.005s
real 0m0.074s
user 0m0.167s
sys 0m0.041s
real 0m0.075s
user 0m0.177s
sys 0m0.033s
$ for i in $(seq 1 5); do time ./ncplayer -b pixel -k -sscale data/notcurses.png >/dev/null; done
real 0m0.463s
user 0m0.423s
sys 0m0.028s
real 0m0.467s
user 0m0.405s
sys 0m0.052s
real 0m0.465s
user 0m0.399s
sys 0m0.056s
real 0m0.466s
user 0m0.415s
sys 0m0.039s
real 0m0.458s
user 0m0.404s
sys 0m0.044s
So Chafa's sixel encoder is slightly north of 6x (six times) faster than Notcurses' when the actual sixel code dominates the benchmark. Single-thread performance is about 2x. Intel Haswell 4/8 core.
The PNG loader is still adding overhead, so if I set up a test harness with raw image buffers, it should increase the delta a bit more.
yep, the AllRGB images i'm testing with are 4096x4096 with each RGB color used exactly once. i'm adding some threading now, and expect to do significantly better soon on such large images. btw, how many cores on the machine you ran those tests on?
4 cores physical, 8 logical. Note that the AllRGB images are huge, so you're still mostly benchmarking the loader. To get a good idea of the encoder's performance, the output image must be (much) bigger than the input image. Since AllRGB images must be 16MP to contain all the colors, I think the only way to get a true benchmark would be to load the image first and put a timer around the encoder calls.
Note the output image must be big. When using AllRGB 4096x4096 images as input you're doing the inverse and spending even more time in libpng/ImageMagick.
@hpjansson
I have gotten a lot of gains, primarily in quality but also just in general structure and performance, from your outline of chafa's approach. Thank you so much for detailing it here!
When palette colors are sorted by their first principal component, you can do a binary search followed by a neighborhood probe for the closest match. This yields colors that are identical to a linear 3-component euclidean search, but instead of 256*3=768 multiplications (for a 256-color palette), it gets by with <16 multiplications typically.
This was such a fun trip to do! :-)
Putting in the eigenvalue solver (I used this one, which is just a C repackaging of JAMA's Java repackaging of EISPACK, all of which are public domain) really warmed the cockles of my black computational chemist heart. Pop in the covariance matrix calc, get the eigenvalues/vectors, and go to town with the binary search.
Thank you also for the pointer on the neighborhood search. I was close-but-not-quite there, and when I put just the two other minor checks in it was like "Woah! That is SO pretty!" :)
The mapping is stored in a flat 16384-entry hash table, so if the exact same color is seen in the near future, it will be satisfied by an inlined O(1) lookup.
I cut about 40%-ish color-matching time by hanging onto the previous binary search answer, and checking the difference in pca1 on the next try: if it was within roughly 8 indices just start from there. Not quite O(1), but still a boost.
Then I added a front-end 8192-entry hash table, and got another boost.
The palette itself is just generated with sparse median cut. Nothing interesting there :) It could probably be sped up with parallel merge sort or maybe a fancier quantization algorithm.
Palette generation is still about 5% of the total time, but I'm not unhappy. I just sample num_colors in uniform 16-pixel runs, and was really surprised how effective that is.
Since the encoder has to do many passes per six-height row, it first converts the row into an efficient internal format, so the data needed for each column can be fetched from memory in one load instead of six. I also took pains to make the column-to-character converter branch-free and divided the row into multiple "banks" with a bloom filter so if a palette color does not appear in a run of say, 64 columns, those can be skipped over quickly.
This is the next frontier for the single-thread case. Having cleaned up a lot of crap around it, those 6 memory loads stick out so much and uuugggh lol.
If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading.
Funny enough, that's exactly how I got here. :-) I want gaming and videos to work, so gotta get a lot faster for it.
Performance gains in the large so far have come from the following:
Thanks to your pointers, after a week or so I think I'm about 4x better on image quality, and 2-3x faster now. Still really far behind both of you here: on my puny i3-7100U @ 2.40GHz it's about 20 seconds for the AllRGB balloon.png :
Act. scan 0.9101s map 0.0843s dither 13.5866s emit 5.8524s
Act. scan 1.0277s map 0.0711s dither 13.4412s emit 5.5109s
Act. scan 0.8645s map 0.0661s dither 13.4089s emit 6.8736s
Act. scan 0.9389s map 0.0603s dither 13.4553s emit 6.8573s
Act. scan 0.9412s map 0.0561s dither 13.4151s emit 6.7967s
Act. scan 0.8742s map 0.0650s dither 13.4132s emit 6.8243s
That appears to be after the JIT has done what it can. On smaller images, you can see it stepping in and the speedup:
Act. scan 0.1206s map 0.0368s dither 0.7361s emit 1.0289s --> 1.9 s
Act. scan 0.0562s map 0.0137s dither 0.6473s emit 0.5490s
Act. scan 0.0698s map 0.0171s dither 0.6141s emit 0.4027s
Act. scan 0.0867s map 0.0147s dither 0.6279s emit 0.4538s
Act. scan 0.0868s map 0.0148s dither 0.6222s emit 0.4592s
Act. scan 0.0566s map 0.0085s dither 0.6174s emit 0.4333s
Act. scan 0.0532s map 0.0090s dither 0.6511s emit 0.4191s
Act. scan 0.0531s map 0.0077s dither 0.6112s emit 0.4105s
Act. scan 0.0650s map 0.0077s dither 0.6112s emit 0.4499s
Act. scan 0.0859s map 0.0090s dither 0.6146s emit 0.4072s
Act. scan 0.0643s map 0.0077s dither 0.6115s emit 0.4268s --> 1.1 s
So I think that's like 1% of your speed? 🤣🤣
You mentioned sixel in TODO. However I didn't realize how sixel perform well in the terminal until I see this: https://github.com/hackerb9/lsix . It's a simple bash script, and it might provide some inspiration?