Applying QOI for video - Githubissues

cppietime commented 2 years ago

Given the simplicity of the QOI format, I think it might be useful as a basis for some kind of video codec, for example, to output video files from a C program without excessive complexity. Say I wanted to write a C program that generates a series of images and produces a video where each image is a frame. I could either output each frame as an image file and use a tool like FFMPEG to combine them, or try to use existing codecs/formats in my program relying either on external libraries or much more complicated algorithms. If a video format based on QOI existed, it may not be so difficult to simply output the video itself directly from my program without all the complexities of more sophisticated formats, so the work could be done quickly at the expense of larger (but hopefully smaller than raw) file sizes.

That being said, storing one roughly-PNG sized image per frame will quickly take a lot of space, so any tricks that can reduce size, including possibly lossy compression, can be useful. I am trying to think of what kind of methods could be employed to save space for sequences of frames. Using YUV color space downsampling to YUV4:2:0 cuts the raw data in half, and storing this in existing QOI format seems trivial, just store each 4 pixels' worth of YYU-YYV instead of the canonical 2 pixels' worth of RGB-RGB. Or perhaps, if alpha is not being stored in the video, 8 pixels can be stored as YYYY-YYYY-UUVV (or any other permutation), taking up the space of only 3. If something like this is used, perhaps there should be separate caches for chrominance and luminance. It might prove kind of tricky since each entry in the cache will represent data across multiple pixels instead of just one.

Then, I'm wondering if there's any good way of adding some kind of quantization to further reduce image size. For example, when hashing pixels to store in the cache, discarding N low bits of each component, then instead of checking that the existing entry in the cache matches the current pixel, check if it is within some tolerance threshold and, if so, treat it as a cache hit. Also could add a tolerance threshold for the RLE.

I also saw a suggestion in another issue of allowing the DIFF opcodes to be scaled, to allow them to cover a greater range at the expense of precision.

Then, if this is used for a video of sequential frames, perhaps there is a way to exploit this to produce inter-frames that compress to even smaller. The simplest way would just be to subtract the previous frame from the current, pixel by pixel. Maybe there's a way to use motion compensation, but I don't think I have nearly a good enough idea of how to do that on my own to make any claims. Maybe something using multi-pixel macro-blocks, but I am hoping for ideas and a discussion as to what might work. When I get the chance, I'll try to do some testing to get actual results for the ideas I've had above.

oscardssmith commented 2 years ago

I doubt that a qoi-like technique will work for video compression. Video is formats like H264/AV1 throw away so much data that it's easy to forget how big video would be with minimal compression. I think the optimal "simple but good enough" video codec would be something like noise filter + basis transform (maybe forier/ maybe something else) + entropy encoding.

nigeltao commented 2 years ago

@oscardssmith

I doubt that a qoi-like technique will work for video compression

I am similarly skeptical but...

@cppietime

I'll try to do some testing to get actual results for the ideas I've had above

...I'm happy to be proven wrong by actual results.

cppietime commented 2 years ago

So far, it seems like some of these methods can reduce the size of the QOI, but some of them also greatly impact image quality. Of particular note is that, for my minimal tests so far, a "run tolerance" of +-1 to 2 reduces the size of a QOI that's otherwise larger than a PNG to be smaller than the same PNG without any obvious artifacting, but I need more images to test on.

cppietime commented 2 years ago

Testing discarding low bits for hashing comparison or difference/luma encoding, while saving space, results in pretty severe artifacts even at just 1 bit. On the other hand, increasing the run tolerance is not so detrimental. For the attached photograph, there is the base PNG, and the PNG resulting from a lossy compression with a run tolerance of +-4, converted to QOI then back to PNG (I can't attach QOI files directly here). photo_original . photo_min The original PNG is 1,131,619 bytes. The lossless QOI is 212,441 bytes, and the lossy one is 122,724 bytes. I did a test of a computer generated image as well, with more regions of contiguous space. An original 206 kB PNG became a 212,441 byte lossless QOI, and a 146,822 byte lossy QOI with a run tolerance of +-2.

EDIT: By the by, a JPEG produced by ffmpeg with -q:v 2 out of the original PNG photograph is 195,884 bytes. With default quality settings, the JEPG is 85,976 bytes.

nigeltao commented 2 years ago

An original 206 kB PNG became a 212,441 byte lossless QOI, and a 146,822 byte lossy QOI with a run tolerance of +-2.

When you say +-2, is that per channel or is that a total tolerance across all four RGBA channels?

The previous comment attached a photographic (non-artificial) test image. How obvious (if at all) is e.g. any banding on screenshots (with gradients) similar to https://github.com/nigeltao/qoi2-bikeshed/issues/21#issuecomment-983257987

cppietime commented 2 years ago

When you say +-2, is that per channel or is that a total tolerance across all four RGBA channels?

It's per channel. The original PNG is 196 kB, 208 kB for lossless QOI screenshot

+-1 tolerance per channel knocks it down to 171 kB without anything immediately visibly different screenshot1

At +-2 I still can't see any obvious difference and it's 144 kB screenshot2

+-4 there's banding, but not horribly, at 120 kB screenshot3

chocolate42 commented 2 years ago

Say I wanted to write a C program that generates a series of images and produces a video where each image is a frame. I could either output each frame as an image file and use a tool like FFMPEG to combine them, or try to use existing codecs/formats in my program relying either on external libraries or much more complicated algorithms.

y4m is the standard raw format for lossless YCbCr video, if you can generate YCbCr y4m is the way to go IMO as most things can ingest it ( https://wiki.multimedia.cx/index.php/YUV4MPEG2 ). You could then use a library directly in your binary or call ffmpeg/whatever externally, if done this way a qoi-like-video-format could be implemented once as a library that ingests y4m and/or raw RGB. RGB and HDR don't appear to have standard formats but I could be wrong, xiph have examples stored as PNG/TIFF: https://media.xiph.org/

That being said, storing one roughly-PNG sized image per frame will quickly take a lot of space

Huffyuv/lagarith are old, FFV1 appears to be the leading open format (which looks partially qoi-like), lossless x264/x265 exist, there may be new kids on the block but these are the main codecs to benchmark against (lossless JpegXL images is the definitive practical intra-frame size benchmark, AV1 may have a lossless mode. http://forum.doom9.org/forumdisplay.php?f=54 ). They all take up a lot of space as you'd expect. Sticking to a purely lossless format is the only way a qoi-like should exist IMO; existing lossy is smart and fast despite being complicated thanks to incredible amounts of dev time poured into it so there's no way a qoi-like can hope to compete. That said pngquant looks to be a good way to make a lossless image codec lossy, there may be a way to apply it: https://pngquant.org/

Then, if this is used for a video of sequential frames, perhaps there is a way to exploit this to produce inter-frames that compress to even smaller. The simplest way would just be to subtract the previous frame from the current, pixel by pixel.

Qoi's delta encoding ops (the main ops) already use a derivative in the form of a reference. QOI uses pixel to left as reference, some variants use the average of left+above as reference, a natural extension to video might be to use the average of left+above+same-location-from-previous-frame as a reference. Would work well for low-change-content, poorly with motion (a flat diff might even be better). Adaptive behaviour might be necessary for good performance, can't see how motion compensation could be applied
A QOI bitstream is compressible so you could simply apply entropy-coding to a GOP, the bigger the GOP the better the compression (if entropy coding is considered appropriate for a qoi-like at all). I've been bit fiddling elsewhere and doing it fast is hard
Cache context could apply to a GOP, the benefit may be small depending on how the cache works

cppietime commented 2 years ago

y4m is the standard raw format for lossless YCbCr video, if you can generate YCbCr y4m is the way to go

The two things I've been able to make work at all so far in my projects are y4m and MJPEG, the latter of which requires a lot of dependency code for the actual compression. If I were able to find or develop a middleground that is simply implemented and fast to encode like y4m but not quite as massive, that would be great.

You could then use a library directly in your binary or call ffmpeg/whatever externally, if done this way a qoi-like-video-format could be implemented once as a library that ingests y4m and/or raw RGB.

What do you mean by a qoi-like-format that ingests y4m or RGB? Just encoding each frame in QOI, or something like encoding the entire continuous video stream as a single QOI "image"?

Huffyuv/lagarith are old, FFV1 appears to be the leading open format (which looks partially qoi-like), lossless x264/x265 exist

I can't for the life of me find specification on the format of x264/5, at least not the actual data representation I would need for implementation. I will look into FFV1 and Huffyuv, although I'll admit QOI appealed to me when I saw it on account of 1, not needing to construct a Huffman code for each image, and 2, staying byte-aligned

A QOI bitstream is compressible so you could simply apply entropy-coding to a GOP, the bigger the GOP the better the compression (if entropy coding is considered appropriate for a qoi-like at all)

What's GOP stand for?

chocolate42 commented 2 years ago

What do you mean by a qoi-like-format that ingests y4m or RGB?

Just suggesting that creating a library of the codec that ingests a standard format is just as easy as mixing the implementation with the generation code.

What's GOP stand for?

Group of pictures, roughly speaking a set of contiguous frames (typically up to a few dozen) that can refer to each other so are decoded as a group. An intra-frame codec that encodes frames separately has a GOP of one, ie all I-frames that can only refer to themselves.

When I said "apply entropy-coding to a GOP" I meant encode each frame separately as qoi images then group frames together and entropy code them, it's the simplest way I can think of to add some inter compression to a mainly intra codec.

When I said "Cache context could apply to a GOP", I meant that a GOP of a few dozen frames could essentially be treated as a single QOI image. Some overhead saved and the cache not resetting every frame should provide a little benefit.

Just encoding each frame in QOI, or something like encoding the entire continuous video stream as a single QOI "image"?

Whatever you choose to do. The main problem with "single giant image" is that the format needs to be read from start to finish as state is important, ie it's a single GOP containing every frame which is not great (no seeking, no resilience to bitstream errors).

I can't for the life of me find specification on the format of x264/5

Even if you found it you'd never implement it, far too complicated. If you use any of the existing codecs just use an existing library or external binary like ffmpeg.

I'm by far not an authority on the topic, doom9 is the place to learn.

cppietime commented 2 years ago

Huffyuv/lagarith are old ... these are the main codecs to benchmark against

Plain old lossess QOI is actually only a few percent larger than full-res RGB huffyuv for photographic images, and is smaller, sometimes by a lot, for cartoon or screenshot-type images. And this is using custom huffman codes created to optimize compression for the image in question, whereas just running ffmpeg -i src.ppm -c:v huffyuv -pix_fmt rgb24 is several dozen-several hundred kB larger than my huffyuv, at least in some tests. In others, it is smaller from ffmpeg, still not sure why that is.

For my above screenshot test, a single huffyuv frame was 1,134,958 B, while the lossless QOI is 391,880 B. For the photograph I posted above, the respective comparison is 1,160,299 B to 1,197,556 B. (ffmpeg produced single-frame AVI files of 1,837,688 and 981,068 B respectively)

On further consideration, the reason FFMPEG sometimes produces a smaller file is probably because it uses a different predictor; I only programmed one.

cppietime commented 2 years ago

In a brief test by just encoding each frame with QOI I got a video down to about 38% its uncompressed raw space. However, when I tried it on the same video in YUV420 colorspace using the method I describe below, it only reduced to 82% the YUV420 raw data, which corresponds to 41% of the original RGB video data. For YUV420, each block of 2x2 pixels is treated as one 6-channel pixel by QOI. Caching and runs are handled the exact same way. If the difference between the U and V channels of the sample from those of the previous sample can be expressed in 3 bits each, OP_DIFF or OP_LUMA are used to encode the pixel, depending on whether all 4 Y samples are the same or not. If they are, OP_DIFF with the U and V differences are written in one byte, then the common Y value is written in the next. If the Y values differ, the same first byte is written, but then 4 bytes, one for each Y, are written. Otherwise, the pixel is written without encoding for difference, either with OP_RGB followed by Y U and V if all 4 Y are the same, otherwise with OP_RGBA followed by Y Y Y Y U V.

chocolate42 commented 2 years ago

Much of QOI's benefit is by implicitly doing a YUV-like transform with the LUMA ops, aka the way R/B is stored relative to G and giving G more bits has a similar effect if you squint. Starting with YUV loses that benefit, as does throwing away half of the data to begin with.

If you release a YUV qoi-like I might be tempted to compete, time permitting.

AZMCode commented 2 years ago

@chocolate42

(no seeking, no resilience to bitstream errors)

The seeking and resilience issues might be partially helped by some of the things people on the Alternative Traversal thread seem to be doing, namely chunking the "image". A (width height (max(width,height)))-sized chunk might work, each containing max(width,height) frames, and each chunk independent of the others.

AZMCode commented 2 years ago

The other thing to do might be to just slap some bit correction on top of the stream to help with resilience within these chunks.

nigeltao / qoi2-bikeshed

Applying QOI for video #37