Closed JackKelly closed 2 years ago
create a stand-alone little adapter library to adapt imagecodecs to be used with Zarr
Awesome, thanks so much for your help!
Some initial result compressing a single, full resolution, full-geo-extent satellite image (5576x3560, 8 bit per pixel grayscale) (HRV_MSG3-SEVI-MSG15-0100-NA-20190620105917_cropped.png
) using cjxl --quality=100
using our data-processing
VM on GCP (32-core EPYC with 128 GB of RAM) :
effort | runtime | filesize (MB) |
---|---|---|
1 | 0.7 sec | 6.6 |
7 | 20 sec | 4.9 |
9 | 2min 52sec | 4.8 |
visually lossless, effort=9 | 2min 30sec | 1.5 |
visually lossless, effort=5 | 0.7 sec | 1.5 |
It's notable that cjxl
spends the vast majority of its time using a single CPU core. Which maybe suggests it'll play nicely with Dask (whereby multiple Zarr chunks can be compressed in parallel).
"visually lossless" means --distance=1
. I should check to see just how lossless "visually lossless" is.
Some other conversions done using imagemagick:
algo | filesize (MB) |
---|---|
WebP lossless | 5.9 |
JPEG lossless | 6.8 |
JPEG 2k lossless | 5.7 |
WebP near_lossless=100 | 0.6 |
PNG | 6.7 |
AVIF lossless | 6.1 |
AVIF min=2 max=2 | 3.2 |
tiff.bz2 | 7.5 |
So, in conclusion, our hunch was correct: JPEG-XL can create the smallest lossless files, but it takes a while! But maybe that's OK if we Dask can compress in parallel. And maybe compute costs go up with, say, the square of the image size, so maybe its way faster to compress multiple small chunks? (I'm totally guessing here!)
And it looks like JPEG-XL can create a file size that's 64% of the size of a bz2 compressed file (7.5 MB for bz2 vs 4.8 MB for JXL)
TODO: Check if 'visually lossless' is good enough.
reading through the jpeg-xl code in imagecodecs, it looks like all we have to do is cast our np.int16
images to np.uint16
and pass the images in, one satellite channel at a time, and the code will figure out that it should use grayscale :slightly_smiling_face:
I'm probably doing something stupid but I'm failing to get imagecodecs.jpegxl_encode
to work... I've submitted an issue:
https://github.com/cgohlke/imagecodecs/issues/29
OK, some initial findings about using JPEG-XL with Zarr:
For jpegxl_encode
to recognise the first dimension as a time dim, it's necessary to append a length 1 channel dim as the last dim; and convert to uint16 or uint8: e.g. satellite_dataset.example_dims(dim="channe", axis=-1).clip(min=0).astype(np.uint16)
To engage lossless mode, use JpegXL(distance=0, level=None, effort=<int between 1 and 9>)
. At the moment, numcodecs.jpegxl_encode
enables lossless mode if and only if level
is None or < 0 (which I think might be a bug: level
actually sets the decoding speed. I've submitted a bug here).
Using lossless mode is a little fiddly. To engage lossless mode, level
must not be None. When using uint16
values in the range [0, 1023], I think JPEG-XL
assumes the image is really dark, and compresses the hell out of it, unless distance
is tiny (like 0.01). distance
can be 0 or 0.01, it can't be non-zero and less than 0.01. If using values in the range [0, 1023], and distance=0.01
, then the output values are in the wrong range, although the image looks OK.
Expanding the range of values to the range [0, 2**16-1] means that distance
can be more like 1
. But it looks like it changes the gamma of the image, which is probably bad for our use-case. Here's the compressed image:
And the raw uncompressed image:
which I have a hunch is because this bit of code uses SRGB if the image is uint8 else uses LinearSRGB. Using uint8 compressed image looks a lot better:
Here are some results:
For four timesteps:
algo | runtime | size |
---|---|---|
bz2, level=5 | 6.8 sec | 2.9 MB |
jpex-xl, distance=0, effort=7 | 8.15 sec | 2.2 MB |
jpex-xl, distance=0, effort=9 | 29 sec | 2.2 MB |
jpex-xl, distance=1, effort=7 | 8.2 sec | 2.2 MB |
jpeg-xl, distance=0.5, effort=9, uint8 | 16.2 s | 0.6 MB |
jpeg-xl, distance=0.5, effort=7, uint8 | 6.8 s | 0.6 MB |
jpeg-xl, distance=0.0, effort=7, uint8 | 8.58 s | 1.3 MB |
bz2, level=5, uint8 | 5.64 sec | 1.8 MB |
bz2 with uint8 is very impressive, tbh. jpeg-xl uint8 is even better, of course :slightly_smiling_face:
If the full dataset is 25TB using bz2, then it'd be more like 5 TB if using jpeg-xl lossy uint8 :) or 15 TB using uint8 bz2. A very nice thing about 5 TB is it'll fit onto a consumer 8TB SSD :slightly_smiling_face:
For a public dataset, bz2 uint8 might be the way to go because it's easier to install.
imagecodecs JPEG-XL can simply be installed through pip (nice!) No need for libjxl to be installed manually as well. That's great news because it means that enabling JPEG-XL compression in Python is as simple pip install imagecodecs
:slightly_smiling_face:
Some conclusions:
JpegXlFuture
wrapper.Detail results:
For four timesteps:
algo | runtime | size | MAE | PSNR |
---|---|---|---|---|
bz2, level=5, int16 | 6.8 sec | 2.9 MB | ||
jpex-xl, distance=0, effort=7, uint16 | 8.15 sec | 2.2 MB | ||
jpex-xl, distance=0, effort=9, uint16 | 29 sec | 2.2 MB | 0 | |
jpex-xl, distance=0, effort=9, float16 | 27.5 sec | 2.2 MB | 0 | 100 |
jpex-xl, distance=1, effort=7, uint16 | 8.2 sec | 2.2 MB | ||
jpeg-xl, distance=0.5, effort=9, uint8 | 16.2 s | 0.6 MB | 0.378 | 51.27 |
jpeg-xl, distance=0.5, effort=8, uint8 | 7.8 s | 0.6 MB | 0.379 | 51.26 |
jpeg-xl, distance=0.5, effort=7, uint8 | 6.8 s | 0.6 MB | 0.365 | 51.51 |
jpeg-xl, distance=0.0, effort=7, uint8 | 8.58 s | 1.3 MB | ||
jpeg-xl, distance=0.0, effort=9, uint8 | 18.2 s | 1.3 MB | ||
jpeg-xl, distance=1.0, effort=9, uint8 | 12.8 s | 0.4 MB | 0.593 | |
jpeg-xl, distance=0.1, effort=9, uint8 | 35.6 s | 1.8 MB | ||
jpeg-xl, distance=0.2, effort=8, uint8 | 10.6 s | 1.2 MB | 0.136 | 56.73 |
jpeg-xl, distance=0.3, effort=8, uint8 | 7.8 s | 0.86 MB | 0.242 | 53.99 |
jpeg-xl, distance=0.4, effort=8, uint8 | 8.08 s | 0.69 MB | 0.317 | 52.45 |
jpeg-xl, distance=0.6, effort=8, float16 | 7.8 s | 0.74 MB | 0.376; or 0.353 after rounding | 52.34 |
jpeg-xl, distance=0.5, effort=7, float16 | 5.43 s | 0.87 MB | 0.324 | 53.85 |
jpeg-xl, distance=0.4, effort=8, float16 | 7.8 s | 1.1 MB | 0.2749 | 55.17 |
jpeg-2k, level=100, uint8 | 4.9 s | 1.1 MB | 0.1625 | 56.01 |
jpeg-2k, level=80, uint8 | 5.2 s | 1.1 MB | 0.163 | 55.99 |
jpeg-2k, level=65, uint8 | 5.0 s | 1.0 MB | 0.178 | 55.55 |
jpeg-2k, level=60, uint8 | 5.0 s | 0.9 MB | 0.210 | 54.78 |
jpeg-2k, level=55, uint8 | 5.0 s | 0.75 MB | 0.2935 | 53.01 |
jpeg-2k, level=50, uint8 | 4.9 s | 0.5 MB | 0.468 | 49.79 |
bz2, level=5, uint8 | 5.64 sec | 1.8 MB |
Some plots of jpeg-xl, distance=0.6, effort=8, float16:
The compressed image:
The notebook for creating these plots is here
Some notes from today's experiments...
Our "pre-prepared batches" have five dimensions:
AFAICT, NetCDF cannot use the numcodecs API. And so we cannot use JPEG-XL (or any other imagecodecs) to compress data into NetCDF files. (Which is a shame because it might have been nice to use JPEG-XL to compress our pre-prepared batches.)
We could use Zarr for our pre-prepared batches. But the JpegXl
class in imagecodecs
only really works with images which are of shape (y, x)
. We can make this work like this:
class JpegXlFuture(JpegXl):
"""Simple hack to make the JpegXl compressor in the currently released
version of imagecodecs (version 2011.11.20) look like the version in development.
(We can't simply use the version in development because the imagecodecs author
does not develop on GitHub. The imagecodecs authors just uses GitHub as one of
several mechanisms to release imagecodecs.)
See https://github.com/cgohlke/imagecodecs/issues/31#issuecomment-1026179413
"""
codec_id = 'imagecodecs_jpegxl'
def __init__(self, lossless=None, decodingspeed=None, level=None, distance=None, *args, **kwargs):
"""
Args:
distance: Lowest settings are 0.00 or 0.01. If 0.0 then also set lossless to True.
level: DON'T SET THIS WITH THIS JpegXlFuture wrapper!
In imagecodecs version 2011.11.20, level is mapped (incorrectly) to the decoding speed tier.
Minimum is 0 (highest quality), and maximum is 4 (lowest quality). Default is 0.
decodingspeed: DON'T SET THIS WITH THIS JpegXlFuture wrapper!
"""
assert decodingspeed is None
if lossless is not None:
if lossless:
assert level is None # level must be None to enable lossless in imagecodecs 2011.11.20.
assert distance is None or distance == 0
else:
# Enable lossy compression.
level = 0 # level must be set to 0, 1, 2, 3, or 4 to enable lossy compression in imagecodecs 2011.11.20.
super().__init__(level=level, distance=distance, *args, **kwargs)
def encode(self, buf):
n_examples = buf.shape[0]
n_timesteps = buf.shape[1]
n_channels = buf.shape[-1]
assert n_examples == 1
assert n_timesteps == 1
assert n_channels == 1
buf = buf[0]
return super().encode(buf)
def decode(self, buf, out=None):
assert out is None
out = super().decode(buf, out)
out = out[np.newaxis, ...]
return out
register_codec(JpegXlFuture)
# Convert to uint8
satellite_dataset["data"] = (satellite_dataset["data"].clip(min=0, max=1023) / 4).round().astype(np.uint8)
lossless = False
distance = 0.6
effort = 8
encoding = {
"data": {
"compressor": JpegXlFuture(lossless=lossless, distance=distance, effort=effort),
},
}
chunks = dict(example=1, time_index=1, y_geostationary_index=None, x_geostationary_index=None, channels_index=1)
satellite_dataset.chunk(chunks).to_zarr(
DST_PATH / (filename.stem + ".zarr"),
mode="w",
encoding=encoding,
compute=True,
)
But this takes a whole minute per batch, and you end up with lots of tiny Zarr chunks on disk (a few tens of bytes!)
The JPEG-XL spec allows for multiple images within a single JPEG-XL file. But I don't think imagecodecs.JpegXl
knows how to decode multiple images. (Although I think it does know how to encode multiple images!) When encoding all timesteps into a single image, it "only" takes 20 seconds per batch. (I've asked a question on the imagecodecs issue queue here)
I'm much more optimistic about using JPEG-XL for our intermediate Zarrs.
(In our current Zarrs, the "real" pixel values are in the range [0, 1023], and NaNs are encoded as -1
).
The slightly fiddly thing is how to encode "NaNs".
JPEG-XL supports uint8
, uint16
, float16
, and float32
.
Trying to compress a float16
or float32
image which includes any NaNs results in JPEG-XL silently failing to save any image data to disk (not surprisingly: NaNs aren't really a thing in "proper" photography, which is what JPEG-XL is mostly designed for).
When using float16
or float32
, the image values are supposed to be in the range [0, 1]
. It appears it is possible to compress images where some values are outside that range (e.g. -1
). But the "illegal" values change quite dramatically. And I'm keen to not do anything that's out-of-spec, for fear that it breaks something.
Where I've landed is for "real" image values to be in the range [0.075, 1]
, and use 0.025
to encode NaNs. For testing, I injected a 100 x 100 pixel "NaN box" of 0.025
values in the bottom right of the image, like this:
original_float32["data"][:, 100:200, 100:200] = 0.025
This does result in a bit of "ringing" (for want of a better term) around the edges of the "NaN box":
All values in the "NaN box" are between 0.0227
to 0.0280
.
(The ringing is much larger amplitude if we use 0.95 as the "NaN marker". Then the values of the "NaN box" range from about 0.90
to 0.99
)
Here's the code I'm using:
original_float32 = original_float32.astype(np.float64) # Use float64 for maths. Then turn to float32.
# Encode "NaN" as 0.025. Encode all other values in the range [0.075, 1].
# But, after JPEG-XL compression, there is slight "ringing" around the edges.
# So, after compression, use "image < 0.05" to find NaNs.
LOWER_BOUND_FOR_REAL_PIXELS = 0.075
NAN_VALUE = 0.025
original_float32 = original_float32.clip(min=0, max=1023)
original_float32 /= 1023 * (1 + LOWER_BOUND_FOR_REAL_PIXELS)
original_float32 += LOWER_BOUND_FOR_REAL_PIXELS
original_float32 = original_float32.where(
cond=original >= 0,
other=NAN_VALUE,
)
original_float32 = original_float32.astype(np.float32)
lossless = False
distance = 0.4
effort = 8
encoding = {
"data": {
"compressor": JpegXlFuture(lossless=lossless, distance=distance, effort=effort),
},
}
# Need to set to a channel so that JpegXl interprets the first dimension as "time".
# original_float32 has chunks: `time=1, y=None, x=None`
zarr_store = original_float32.expand_dims(dim="channel", axis=-1).to_zarr(
OUTPUT_ZARR_PATH,
mode="w",
encoding=encoding,
compute=True,
)
Doing this, the PSNR is a very healthy 54.37. But beware that PSNR increases if you simply brighten the image (which is what we've done!)
The size on disk for my four test timesteps is a total of 690 kB, which is 24% of the size of the same image compressed using bz2 :slightly_smiling_face:
Let's use JPEG-XL, distance=0.4, effort=8, lossless=False. Encode NaNs as 0.025. Encode "real" pixel values in the range [0.075, 1]. The recover the original image with something like:
# Set all values below 0.05 as NaN:
recovered_image = decompressed_image.where(decompressed_image > 0.05)
recovered_image -= LOWER_BOUND_FOR_REAL_PIXELS
recovered_image *= 1023 * (1 + LOWER_BOUND_FOR_REAL_PIXELS)
Doing this gives a PSNR of 53.74.
Detailed Description
JPEG-XL is the "new kid on the block" for image compression. And it can losslessly compress greyscale images (using colorspace YUV 400). It might be great for compressing satellite images into our Zarr datasets.
Context
See issue #45 for previous experiments and notes.
This great comparison of lossless compression using JPEG-XL, WebP, AVIF, and PNG suggests JPEG-XL wins.
Possible Implementation
imagecodecs
includes an interface to JPEG-XL.Next step is to try manually compressing images using the
cjxl
app (not ImageMagick).If that looks good then create a stand-alone little adapter library to adaptUPDATE: @cgohike has already implemented this in imagecodecs! (See comment below)imagecodecs
to be used with Zarr. Here's a super-simple little python library (just 51 lines of code!) which enables jpeg-2000 compression in Zarr using imagecodecs. Maybe it'd be possible to use the same pattern, but for JPEG-XL?To use ImageMagick for quick experiments at the commandline:
You need ImageMagick version > 7.0.10 to use JPEG-XL.
To install ImageMagick >7.0.10:
See here for how to install ImageMagick from source.
Then use 'magick' command not 'convert'.
I'll do some experiments later today or tomorrow :slightly_smiling_face:
TODO
.jxl
(can't see how to do this and, anyway, Chrome cannot currently open jxl files)