nigeltao / qoi2-bikeshed

"Quite OK Image" version 2 discussions
32 stars 0 forks source link

LEA compressor #36

Open Piqlet opened 2 years ago

Piqlet commented 2 years ago

https://encode.su/threads/3818-LEA-Lossless-image-compressor

chocolate42 commented 2 years ago

Nice, I figured there's a good community for experimental compressors but never looked for it. Any chance for the source of LEA, or at least a Linux binary?

Piqlet commented 2 years ago

I don't know, I mostly just read this forum. You could try running the exe files under wine. It worked for me with ffmpeg.exe, in terminal of course. wine ffmpeg.exe -i video.file ...

MarcioPais commented 2 years ago

Hi, I'm the author of LEA, I've just updated it to v0.3, which brings some nice speed improvements. It should work fine under wine.

Its purpose is diametrically opposite to that of QOI though, it seeks to be pareto-optimal at reasonably high compression ratios.

I highly recommend joining the community at the encode.su forum if anyone is interested in data compression.

dumblob commented 2 years ago

@MarcioPais that's awesome news, thanks! Do you provide the source code anywhere (I couldn't find it anywhere)? What's the license?

chocolate42 commented 2 years ago

Wine is fine as a last resort but I'm getting some questionable (but repeatable) results, fair comparisons seem to be hindered by lack of native binary. Here's some with LEA 0.3 and qic 1.demo under wine:

LPCB

skylake laptop part single thread
[p24rle0up0] compress 14 decompress 12 size 1870508243
[qic] compress 23 decompress 23 size 1285421855
[LEA v0.3] compress 526 decompress 554 size 1032030959

zen2 laptop part single thread
[p24rle0up0] compress 11 decompress 9 size 1870508243
[qic] compress 41 decompress 47 size 1285421855
[LEA v0.3] compress 601 decompress 589 size 1032030959

The zen2 part is stronger so it makes sense that my code runs quicker on it, qic and lea should run quicker too but don't with qic being much slower. Wine overhead should be minimal and hit both sets of hardware about the same, never seen a massive difference like this before. There are wine version differences between the hardware which I'm looking into, otherwise the zen2 hardware is stronger in pretty much every respect. The skylake part is closer to the hardware that qic was (probably heavily) optimised for which may be part of it, the qic skylake result seems about right.

Its purpose is diametrically opposite to that of QOI though, it seeks to be pareto-optimal at reasonably high compression ratios.

QOI is weak by compression standards, but a variant may have some utility as a preprocessor to a generic compressor. Nothing mindblowing, these early results look like some refinement could snag a red highlight on the LPCB chart below QIC:

LPCB
[zstd -1] compress 9 decompress 6 size 2612952424
[p24rle0up1, zstd -1] compress 17 decompress 14 size 1443187543
[p24rle1up1, zstd -1] compress 24 decompress 16 size 1438522963
[qic] compress 22 decompress 24 size 1285421855
[LEA v0.3] compress 526 decompress 554 size 1032030959

edit: An older wine version on the zen2 part was the issue, what a difference a year makes.

MarcioPais commented 2 years ago

@dumblob LEA is closed-source, at least for now. It's just a prototype showing a really simple technique to achieve very high compression ratio on photographic images, easily beating JPEG-XL even at its best. Encoder + decoder are slightly less than 500 LOC in C++.

@chocolate42 Have you tested the new encoders from Luca Versari, FJXL and FPNGE? They claim to be pareto-optimal for encoding speed. I doubt any variant of QOI combined with a postcoder can beat them.

chocolate42 commented 2 years ago

Have you tested the new encoders from Luca Versari, FJXL and FPNGE?

[qic]           compress 18 decompress 17 size 1285421855
[fjxl]          compress  8               size 1402397674
[t24, zstd -1]  compress 17 decompress  9 size 1440641690
[p24, zstd -1]  compress 16 decompress 11 size 1443187003
[fpnge]         compress 33               size 1495756024
[p24]           compress 11 decompress 10 size 1734016615
[zstd -1]       compress  8 decompress  6 size 2612952424
[t24]           compress  5 decompress  5 size 3462571880
[cp]                      2               size 3462571880

t24 just does a lossless transform, p24 packs the transformed pixels into qoi-like delta ops of 1/2/3/4 bytes per pixel. p24 compresses slightly worse than t24 but feeds the entropy coder much less data. The p24 packing appears to be a good fit for ~2/3 of LPCB images, a second stage finding a better fit may improve efficiency (if fjxl efficiency can be beaten we're then competing with qic's speed for a red highlight). Alternatively somehow more than double the speed (probably has to be with vector instructions) for a red highlight below fjxl.

Nothing innovative, but it's an interesting exercise in optimising basic memory management if nothing else. Any fast generic compressors worth trying to replace zstd?

MarcioPais commented 2 years ago

As I've said, no generic postcoder is going to help; at the particular zone of the pareto frontier that you're targeting, you just need a good SIMD decorrelator followed by a SIMD entropy coder.

I see you've joined the forum. You should probably make a new thread there for some crowdsourced ideas, as this is probably getting a bit off-topic with regards to the purpose of this repo. QOI is interesting not for being particularly good, but for it's simplicity - it's a nice way to start learning about data compression.

chocolate42 commented 2 years ago

As I've said, no generic postcoder is going to help; at the particular zone of the pareto frontier that you're targeting, you just need a good SIMD decorrelator followed by a SIMD entropy coder.

Makes sense, I thought a postcoder was picking up the slack from a weak decorrelator but after checking (what I believe is) the optimal entropy coding of handling each component as a separate stream the existing preproc+entropy can beat fjxl on size in theory, assuming the maths holds up. ANS should do it but I'm not there yet.

I see you've joined the forum. You should probably make a new thread there for some crowdsourced ideas, as this is probably getting a bit off-topic with regards to the purpose of this repo. QOI is interesting not for being particularly good, but for it's simplicity - it's a nice way to start learning about data compression.

I'll do that, when I've learned a bit more and have a chance of not coming off as a complete crank. There is a little juice left in a qoi-like preproc but probably not enough to beat qic on size.

MarcioPais commented 2 years ago

I'll do that, when I've learned a bit more and have a chance of not coming off as a complete crank. There is a little juice left in a qoi-like preproc but probably not enough to beat qic on size.

You seem to already have a pretty good grasp of this, I'm sure people there will appreciate your efforts.

Beating QIC on size isn't the problem, doing it at about the same order of speed with a qoi-like approach is.

From a quick glance at the issues posted here and on the original repo, it seems no one was even trying simple context modelling. Something like:

qoi_rgba_t index[8 * 64]; ... index_pos = ((px_prev.rgba.r + px_prev.rgba.g) & 0x1C0) | (QOI_COLOR_HASH(px) & 0x3F);

i.e., using a fast pseudo-luminance approximation to select the hash table to use. You'd need to run some tests to find the best quantization on a large, good representative testset so as not to dilute the contexts, but it will almost surely provide much better hit rates. And if you can get acceptable hit rates with only 32 entries per table, you get more freedom to design the opcodes.

chocolate42 commented 2 years ago

Got a basic ANS working (using ryg_rans) which confirms that a simple qoi-like can beat fjxl on size (p24ans), it's ~40% slower than QIC but has had no optimisation. 3 qoi ops (232b, 464, 888), a stream per component plus a control stream for the opcodes makes 10 streams (only 4 are used per pixel, the streams are for efficiency not speed).

Now that byte alignment is out of the window and direct entropy coding is used, all the qoi-like encoding is really doing is partitioning the pixels such that more predictable pixels use more efficient entropy coding. We'd be better off discarding qoi-like ops in favour of a better partitioning scheme. Having one g stream, 9 r streams and 9 b streams (partitioned based on the bit depth of g), gets us to ~1307MB in theory (g24ans). Adding RLE (requiring an additional run stream and a control stream) gets us to ~1284MB in theory (g24ans_rle), possibly beating QIC on size depending on how optimal ANS is. An adaptive (per image) partitioning scheme is worth exploring but may be expensive.

[g24ans_rle] est. from entropy limit   size ~1284000000
[qic]        compress 18 decompress 17 size  1285421855
[g24ans]     est. from entropy limit   size ~1307000000
[p24ans]     compress 25 decompress 19 size  1353986171
[fjxl]       compress 8                size  1402397674

From a quick glance at the issues posted here and on the original repo, it seems no one was even trying simple context modelling

In the interest of having something that should be easier to SIMD I've avoided index ops entirely for now. If and when a more efficient encoder is attempted thank you for that rather large hint.

Once g24* has been implemented and confirmed I'll probably make a thread on the forum.

edit: Unfortunately ANS (or at least the way I'm using it) doesn't come close enough to the entropy limit for g24rle to beat qic:

[qic]     compress 18 decompress 17 size 1285421855
[g24rle]  compress 27 decompress 18 size 1295987792
[fjxl]    compress 9                size 1402397674

g24rle does two passes, one to gather frequency stats and one to statically encode ANS. Adaptive ANS might have better efficiency and might also be quicker as it would eliminate a pass, TODO.