JackKelly commented 2 years ago

Video compression has developed a lot over recent years (driven by Netflix etc.)

Our sequences of satellite images and NWPs can be considered video sequences. There's lots of redundant information across frames. So, if we wanted to squish the data down as much as possible (e.g. for sharing with students; or for regularly sending to Lancium; or just for archiving many years of data without breaking the bank) then we might want to consider using video compression like AV1 to compress our satellite data and/or NWP data.

ffmpeg supports AV1 encoding, including lossless, 10-bit, and 12-bit

And ffmpeg-python supports moving data between numpy arrays and ffmpeg.

If we really wanted to, we could probably write a numcodecs-like compression library to allow us to use ffmpeg to compress stuff, and still save into NetCDF / Zarr.

In terms of pre-prepared batches, it may be far easier to save each example as a standard video file (rather than trying to use AV1 within NetCDF... e.g. save as a sequence of TIFFs, and then ask ffmpeg to convert those TIFFs to a video file compressed using AV1). Which we can do now that we're saving each modality separately :)

This is not a priority, of course!

Twitter discussion.

JackKelly commented 2 years ago

If we want to use video compression in our Zarrs then we might have to use Zarr chunks which span multiple timesteps.

If, instead, we want to compress each timestep independently, then AVIF might be worth looking at (which uses AV1 compression for still images).

JackKelly commented 2 years ago

Copying a Slack conversation @jacobbieker and I just had...

Jacob:

[For full-geospatial-extant Zarrs] It seems to be using around 44mb on average per timestep for HRV and non-HRV (30mb for non-HRV, and 14mb for HRV), which ends up being around 380GB per month, or 4.1TB per year, because 1 month of each year the RSS is shutdown. So quite a bit more data. So might be worth checking out higher compression

Me:

Cool, thanks, sounds good! TBH, my guess is that (slightly) lossy compression might be fine. Although, first, it'd be great to see how well "modern" lossless compression works. AV1 video compression might be interesting, although I think we'd then have to use Zarr chunks which span multiple timesteps.

Jacob:

Yeah, we could try maybe with 3 timesteps? It would limit the downside of loading lots of frames, but could still then compress quite a bit? Downside with saving it that way is I think I'd need to then also do the data processing in chunks, rather than each timestep being separate as it is now

Me:

ah, good point, I'd forgotten about that! hmmm... I'm really split... on the one hand, video compression should result in much smaller files (because consecutive frames are pretty similar). But, it also sounds like it might be a fair amount of work! As a quick and hacky test, it might be worth manually outputting a handful of frames as TIFFs, and then using ffmpeg to encode those TIFFs as a lossless AV1 video file, and seeing what the compression ratio is like?

Jacob:

Yeah, I think possibly benchmarking some more image compression ones first might be the easiest to try, otherwise, I can try doing it in chunks, if we can reduce the filesize, even by 10%, we'd then save around 4TB of storage or something like that

JackKelly commented 2 years ago

Yeah, so, I'd recommend trying AVIF first (the still-image version of AV1).

If AVIF is hard to implements, then slightly lossy 8-bit JPEGs might be worth a go. I don't know for sure, but I'm pretty sceptical that our models are currently benefiting from pristine 10-bit lossless imagery! :slightly_smiling_face:

JackKelly commented 2 years ago

The python library imagecodecs supports AVIF.

JackKelly commented 2 years ago

And here's a super-simple little python library (just 51 lines of code!) which enables jpeg-2000 compression in Zarr using imagecodecs. Maybe it'd be possible to use the same pattern, but for AVIF?

jacobbieker commented 2 years ago

Thanks! I'll try those out

JackKelly commented 2 years ago

Awesome, thanks! Right, I'll stop procrastinating tax-related tasks now... :slightly_smiling_face:

JackKelly commented 2 years ago

And here's a useful issue, including a short guide to creating codecs for Zarr: https://github.com/zarr-developers/numcodecs/issues/73

cgohlke commented 2 years ago

FWIW, the imagecodecs library includes numcodecs compatible codecs. Register with imagecodecs.numcodecs.register_codecs(). It's all work in progress but good enough to experiment with.

jacobbieker commented 2 years ago

Using the jpeg2k, saving 3 channels as individual timesteps saves about 57%, when using level=100 compared to the zstd, the jpeg2k throws an error when encoding saving multiple timesteps though.

jacobbieker commented 2 years ago

Using bz2, which is what pbzip2 is based on, reduces the zarrs by 23%, while being lossless, and not needing special handling

jacobbieker commented 2 years ago

Higher levels for zstd didn't result in any real savings

jacobbieker commented 2 years ago

47 might also be an easy win on compression. Tried AVIF but it doesn't do monochrome images, will try with 3 channel images soon.

jacobbieker commented 2 years ago

Fixing the dtype saves another 13% on the size when using bz2

jacobbieker commented 2 years ago

Fixing the dtype and using bz2 level 3 or 5 results in a 28% reduction in size compared to zstd

openclimatefix / Satip

Encode sequences of satellite images using modern video compression like AV1 #45

47 might also be an easy win on compression. Tried AVIF but it doesn't do monochrome images, will try with 3 channel images soon.