New DDS decoder - Githubissues

RunDevelopment commented 3 months ago

I made a new DDS decoder, because the old DXT-based decoder was very limited (only DXT1-5 + dimensions divisible by 4) and incorrect (DXT1 colors were not rounded correctly, resulting in discolorations). While this PR is not finished yet, I already implemented the following features:

Support for DXT1-5 and BC1-5 (including arbitrary image sizes).
Support for all (s)normalized and floating point uncompressed DX10 RGB formats. So most of DXGI_FORMAT.
Support for many DX9 pixel formats. This is done by mapping DDS pixel formats to DXGI_FORMAT (almost; I actually map to a supported formats enum, but this enum closely follows DXGI_FORMAT). This means that there are DX9 pixels formats the new DDS decoder does not, but I couldn't produce/find any DDS files that use them.
Support for non-standard DDS files. Some old and current encoders use custom flags not mentioned in the official docs (e.g. here are the pixel format flags TexConv uses vs the officially documented ones). I used the same flags for TexConv and ignored unknown flags. This allows the new DDS decoder to support more old DDS files with uncommon and non-standard formats.
Support for cubemaps. Cubemaps are displayed in a 4x3 grid that shows the unfolded cube. Example:

The only main formats that are still missing are BC6 and BC7. These 2 are quite complex, so it will likely take me some more time to implement them. Once those 2 are implemented, this should be a pretty competent DDS decoder.

Some notes and technical decisions:

I use &mut dyn Read in all format decoders to reduce binary size. Since there are a lot of DDS formats, there are a lot of functions to decode these formats. I think the binary size could be reduced even further, so I'm very open to feedback in that regard.
I put particular care into ensuring that the decoded colors are correctly rounded. DirectX uses float32 everywhere internally to read DDS files. So to output a 5-bit value as 8-bit, DX does (x as f32 / 31.0 * 255.0).round() as u8 (= first convert to a f32 value in the range 0-1, and then convert to u8 with rounding). Doing this with f32 is quite slow, so I did with only integer operations that are around 3x faster than f32 and around 2x harder to understand. The tricks I use are explained in x5_to_x8 in convert.rs.
I already added the code to read arbitrary surfaces from a DDS file. So reading mip chains, volumes, and image arrays can be easily supported. There's just not much use for it right now aside from cubemaps.
I used Paint.net to determine what DDS files should look like. I did this because I have not found a single DDS file it decoded incorrectly so far, so I used it as my source of truth. Paint.net uses DirectTex (same library that texconv uses) under the hood AFAIK, so everything should be correct.

With what I did so far out of the way, I have some questions to the maintainers on how to integrate this into the image crate:

How should I test this? Used around 20MB of images (>100 files) and tests/reference_image.rs > render_images to ensure that the decoder would correctly, but 20MB is a lot of image data to commit to a repo. I can go down to around 8MB, but not much lower. The issue is that I only have one file for some rare formats and no way to generate smaller images of those formats.
Benchmarking. I have not benchmarked decoding DDS files yet (although it shouldn't be too slow). The current bench setup also seems to be ill-suited for DDS, as I need to benchmark each format individually. How should I go about this?
Should this in a separate crate? I initially wanted to make this a separate crate, but found it vastly easier to add it to the image crate
What should we do with the old DXT implementation? It's not used by the new DDS decoder, and it's deprecated for some time now. Maybe it could be removed after the new DDS decoder is in?
How careful do I have to be with resource limits? Most formats currently read line-by-line into a temporary buffer from which pixels are then decoded. The size of this buffer is proportional to the width of the image. This means that we only need O(sqrt(N)) (N=number of pixels) additional memory for roughly square images. However, an attacker would supply an image with height=1, causing the temporary buffer to as large as (but no larger than) the output buffer. I already limited the width and height of DDS images to be at most 2²⁴, so the temporary buffer can be at most 256MB (with any R32G32B32A32_* format).

Also, (not a question) this my first large-scale Rust PR, so please feel free to pick apart and suggest improvements to everything you see.

RunDevelopment commented 3 months ago

Thanks for the quick response!

Also, for bc decoding there's already a crate which looks interesting and doesn't even have dependencies: bcdec_rs. As far as is possible, I'd see if the bc.rs and convert.rs contents and parts of decoder.rs might be upstreamed to that crate and then used via a dependency. It's important to reduce the amount of code to be maintained centrally. (Aside: they, as in the author, also seem to have a file/encoding/decoding crate about it, image_dds, so might be interested anyways).

bcdec_rs is certainly interesting. Especially since they already implemented BC6/7 (although their code scares me). However, they too seem to get rounding slightly wrong from looking at the code. I'll have to do some testing to be sure though. Assuming it's all correct (or that we fixed any issues), bcdec_rs should be able to replace bc.rs completely.

However, I don't think that the things in decoder.rs fit the theme of bcdec_rs, so I don't think moving them into that crate would fit. Of course, it's up to the author of bcdec_rs to decide what does and doesn't fit.

On the topic of image_dds: I might also add an encoder for uncompressed DDS files (= everything except DXTn/BCn) in a future PR. Block compression encoding might be a bit too difficult for me to get something good working, but all other formats are uncompressed and almost trivial to encode. I'm saying this now, because it might change your decision on whether this all should be a separate crate.

I also want to mention that I don't intend to maintain this code, so if you want to make it a separate crate, I would ask to put it under the image-rs org. I don't have any experience with maintaining a crate and I don't do a lot of Rust in general.

How should I test this?

8 MB isn't that much if the files don't change. That is, are they free of copyright risks etc. This would be my primary concern. Still I'm wondering why is it that the test images are this large?

Copyright won't be a concern for most images. I tested most formats with a 257x131 test image I created myself, and then used texconv to create DDS files in different formats. This is the image btw:

However, there are some images for which copyright will be an issue. For some less-common formats, I only have one image per format that I found in some games (e.g. Elden Ring and the Dark Souls games). Given that all of these formats are uncompressed (no fancy block compression), I could write some scripts to generate images in those formats. Non-standard flags could also be tested by just hex-editing the headers of similar DDS files.

As for the size: most DDS formats are uncompressed, so even my little test image produces a 174KB DDS file when saved with 8bpc RGBA color. Take that times 100 files and that's why. My main way of reducing the size would be use an even smaller test image btw.

Doing this with f32 is quite slow, so I did with only integer operations that are around 3x faster than f32 and around 2x harder to understand.

I'm not sure all the tricks are worth the infamiliarty burden. Did you benchmark them? End-To-End numbers would be most convincing here, some floating point operations are quite fast, unless integer versions are smaller and can be vectorized better.

I did a quick benchmark for a DX10 B5G6R5_UNORM image. (This format basically just does a range conversion (the same one BC1 uses) and that's it, so it's the ideal format to test this with.) I also found that the bcdec_rs crate uses a different trick for the range conversion. I don't understand how they did it, but they found a version that uses a bit shift instead of a division at the end. And it's crazy fast:

my int trick
  time:   [84.337 µs 84.543 µs 84.747 µs]

f32
  time:   [112.58 µs 112.93 µs 113.36 µs]
  change: [+32.418% +33.158% +33.792%] (p = 0.00 < 0.05)

bcdec_rs int trick
  time:   [57.821 µs 58.017 µs 58.255 µs]
  change: [-31.977% -31.377% -30.646%] (p = 0.00 < 0.05)

For the f32 conversion I used this snippet (similar for `x6_to_x8`): ```rs #[inline(always)] pub(crate) fn x5_to_x8(x: u16) -> u8 { debug_assert!(x < 32); const FACTOR: f32 = 255.0 / 31.0; (x as f32 * FACTOR + 0.5) as u8 } ``` So I already optimized to multiply with a single constant and use the truncation `as u8` performs for rounding by adding 0.5. Especially the truncation trick is a *lot* faster than the call to round, which compiles to a literal `call` instruction.

So I think I'm going to switch to the trick bcdec_rs uses :)

And I think I also find constants for the same trick to use in other range conversion function. I don't know how they derived their constants, but they are small enough that I can use brute force to find more.

RunDevelopment commented 2 months ago

Alright, so it's pretty much ready now.

Updates:

I used bcdec_rs to add support for BC6 and BC7, so we support all commonly-used DDS formats now.
I also optimized how pixels are written into the output buffer. For some formats, this made decoding around 2x faster.
I removed the old DXT module becasue it was entirely unused and the compiler complained.

There are just a few things we have to talk about:

bcdec_rs: As I suspected, bcdec_rs does not round correctly. This affects all block compression formats except for BC6 and BC7. This isn't even a bug with bcdec_rs directly, because it inherited it from the bcdec C library.

Unfortunately, both the author of bcdec_rs and bcdec do not seem to be convinced that this is a bug or that it is worth fixing. I'll talk to them more, but that this means that we can't use bcdec_rs for BC1-5 for the time being.

bcdec_rs also doesn't support signed BC4 and BC5.
Clippy: DxgiFormat and PixelFormatFlags has a few unused associated constants. These constants are there for completeness, so removing them isn't really an option, because I'm not too keen on adding in DXGI_FORMAT constants on demand.

I could slap #[allow(unused)] on all these constants, but that doesn't seem like a nice solution. Is there a better way to fix this?

RunDevelopment commented 2 months ago

I've also run into a problem with testing: How do I test RGB32F images? render_images doesn't output them.

RunDevelopment commented 1 month ago

Sorry for the delay!

I mostly finished up the PR now.

I fixed all clippy problems.
I tried using the bcdec_rs crate for BC decompression as much as possible, but its BC4 implementation is incorrect, so I don't use it. I've made a PR to fix this upstream in bcdec, so hopefully this won't be an issue for much longer.
I added tests for all supported formats that I could test.

However, I need help with testing the floating-point formats. I don't know how to generate reference images for them, since render_images in reference_images.rs doesn't support f32 images. @HeroicKatora, could you please help me here?

Otherwise, this PR is ready as I see it. The new DDS decoder is correct, decently performant, and covers most DDS formats.

@HeroicKatora Please let me know what else needs to be done/changed.

image-rs / image

New DDS decoder #2258