richgel999 / bc7enc_rdo

State of the art RDO BC1-7 GPU texture encoders
Other
180 stars 32 forks source link

BC4/5 fixes and performance improvements #18

Open cwoffenden opened 2 years ago

cwoffenden commented 2 years ago

This fixes #17 but goes further:

Lots of text snipped, jump down to the next paragraph. Originally this expanded the internal endpoints to 14-bit, but in testing the RMSE and PSNR were always slightly worse even though the max error was reduced. These errors were higher due being calculated from the 8-bit PNG file, not the hardware's representation. Ryg's blog entry has a good explanation of the hardware.

I simplified this commit to address the main issue, which was blocks with two (or few) values having errors in hardware due to one endpoint always being interpolated (which doesn't occur with an 8-bit software decoder). This is achieved by starting the search radius at zero and working outwards (0, -1, 1, -2, 2, etc.). Further, once we have zero error we take this block as the best available and exit early.

This fixes the original issue, keeps the max error, RMSE and PSNR exactly the same, and improves performance. Some timings, using the default -hr5 radius:

Original code:

BC4

flowers-2048x2048
Total encoding time: 0.599000 secs
Total processing time: 0.656000 secs

quenza-2048x2048
Total encoding time: 0.825000 secs
Total processing time: 0.883000 secs

BC5

bunny-nmap-2048x2048
Total encoding time: 0.446000 secs
Total processing time: 0.510000 secs

can-nmap-2048x2048
Total encoding time: 0.342000 secs
Total processing time: 0.398000 secs

This commit:

BC4

flowers-2048x2048
Total encoding time: 0.476000 secs
Total processing time: 0.534000 secs

quenza-2048x2048
Total encoding time: 0.725000 secs
Total processing time: 0.784000 secs

BC5

bunny-nmap-2048x2048
Total encoding time: 0.214000 secs
Total processing time: 0.271000 secs

can-nmap-2048x2048
Total encoding time: 0.212000 secs
Total processing time: 0.268000 secs

All timings were from the best of four runs. The biggest improvement was in normal maps since there are large areas with 2-3 values hovering around 127, and since the search radius is now growing outwards these are found early on.

richgel999 commented 2 years ago

Thank you, this looks very valuable. Have you tested these changes on a large amount of content to verify the output encoding hasn't changed? That's one of my primary concerns, initially.

cwoffenden commented 2 years ago

I'll try to find the time to throw a few thousand grayscale and normal maps at it and verify the error metrics and times. The encoded output may differ (e.g. two-value BC4 will always use selectors 0 and 1 instead of a single endpoint plus an interpolation) but the decoded output at 8-bit should be the same, so I could hash the decoded PNG. I don't think I'll be able to do that in the next few weeks though.

cwoffenden commented 2 years ago

I have some initial results. I wrote this (rather sprawling) test runner to verify everything:

https://gist.github.com/cwoffenden/98780e9009a2d4f62433ea9f77ef4113

You can give it a directory of PNGs and it'll compress them then collect the metrics in a CSV file. For example:

./runbc7enc.py -b 4 -o /Volumes/Temp -x ./bc7enc -l nfproj-grey-orig.csv -t /Volumes/Work/Assets/Test/Numfum/grey

This ran the BC4 encoder on 450-ish greyscale files and recorded the max error, RMSE and PSNR (but ignored the time, just so I could do a quick diff). Here are the results: the original and changed code.

The RMSE and PSNR don't change (probably not enough digits) but the max error does, in an interesting way. There are five differences in this set of files, with four of the five having a lower-by-one max error in the new code. It's interesting because it highlights a potential accidental improvement which I'll look at in the week (better selection of the best block).

I'll cover the processing time later when I've thrown more files at it (short version: it's faster, about 20% average when fed 100s of normal maps). On Mac it doesn't build with OpenMP (it's not supported out of the box) so I want to wait until I'm back at work to test on other OSes.

I can share the test files with you so you can verify if you like? I have a classifier go through internal projects and pull out different texture types.

cwoffenden commented 2 years ago

I ran the same on approx 1400 other greyscale files and recorded two more from them all where the max error is lower in the changed code. CSV files here.

It's totally accidental that it swings this way, since I've seen a few normal maps where the max lower is in the original code. It's to do with taking the summed error and calling the lowest value the best, rather than looking at which equal summed errors have lower averages or maximums. Specifically here:

https://github.com/richgel999/bc7enc_rdo/blob/e6990bc11829c072d9f9e37296f3335072aab4e4/rgbcx.cpp#L2859

trial_err needs to have further refinement.