richgel999 / bc7enc_rdo

State of the art RDO BC1-7 GPU texture encoders
Other
180 stars 32 forks source link

VC 2017 compiler bug and workaround with mode 4/5 blocks in bc7decomp.cpp #10

Open CoriolisStorm opened 3 years ago

CoriolisStorm commented 3 years ago

Thanks for open sourcing this great tool! We at Respawn plan to start using it soon for an internal tool making some of the textures in Apex Legends.

I was compressing using bc7e, decompressing using bc7decomp.cpp, and dumping the results as a PNG. It looked great, except some pixels that should have been orange were green instead. I disabled optimizations on bc7decomp.cpp so I could step through the code and see how it was encoded, and to my surprise, the colors it showed in the debugger were right! I let it finish, and the green pixels were in fact the proper orange.

I usually suspect uninitialized data when optimized code behaves differently, but when I looked into it in the debugger and read the generated disassembly, the compiler actually had a bug. This shocked me; compiler bugs are very rare! I've only encountered 2 other compiler bugs in 20 years of professional software development.

In unpack_bc7_mode4_5, there's a nested loop to initialize "endpoints[e][c]" on lines 393-397. With optimizations enabled, it looks like VC++ 2017 swapped the order of the inner and outer loops, then unrolled the loop over "c" (which was the outer loop but is now the inner loop). This changed where the bits read from "color_read_bits" ended up. Once the end points were decoded wrongly, everything else was wrong after that.

Here's the relevant disassembly, with comments added to help see what's going on:

00007FFE8DC34360 40 0F B6 C7          movzx       eax,dil         // rdi has "color_read_bits"
00007FFE8DC34364 48 8D 52 04          lea         rdx,[rdx+4]     // rdx is "end_points", with an offset
    {
        for (uint32_t e = 0; e < ENDPOINTS; e++)
        {
            endpoints[e][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
00007FFE8DC34368 41 22 C6             and         al,r14b              // r14b is 0x7F for mode 5, 0x1F for mode 4
            color_read_bits >>= ENDPOINT_BITS;
00007FFE8DC3436B 48 D3 EF             shr         rdi,cl               // cl is 7 for mode 5, and 5 for mode 4
00007FFE8DC3436E 88 42 FA             mov         byte ptr [rdx-6],al  // this stores endpoints[e][0]
00007FFE8DC34371 40 0F B6 C7          movzx       eax,dil
00007FFE8DC34375 41 22 C6             and         al,r14b  
00007FFE8DC34378 48 D3 EF             shr         rdi,cl               // this stores endpoints[e][1]
00007FFE8DC3437B 88 42 FB             mov         byte ptr [rdx-5],al  
00007FFE8DC3437E 40 0F B6 C7          movzx       eax,dil  
00007FFE8DC34382 41 22 C6             and         al,r14b  
00007FFE8DC34385 48 D3 EF             shr         rdi,cl  
00007FFE8DC34388 88 42 FC             mov         byte ptr [rdx-4],al  // this stores endpoints[e][2]
00007FFE8DC3438B 49 83 E8 01          sub         r8,1                 // r8 counts down from 2 to 0, so it is "2 - e"
00007FFE8DC3438F 75 CF                jne         bc7decomp::unpack_bc7_mode4_5+100h (07FFE8DC34360h)  

Each of 2 iterations picks off N bits at a time (N = 5 or 7) and stores them in end_points[e][0..2]. In shorthand with everything unrolled, it writes the picked-off bits in order 00 01 02 10 11 12. However, the C++ code says the order should be 00 10 01 11 02 12. Reordering which bits went to which array entry caused the improper decompression.

My workaround was to manually unroll the inner loop, so that the optimizer wouldn't switch the inner/outer loops. Once I did that, the green pixels were orange in optimized builds as well. There may be other, better workarounds.

        // VC++ 2017 with full optimizations reversed the orders of the loops, not realizing it changed which
        // bits got extracted from 'color_read_bits' for all but the first and last iteration.
        static_assert(ENDPOINTS == 2);      // This is a manual unrolling of a loop over ENDPOINTS
        endpoints[0][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
        color_read_bits >>= ENDPOINT_BITS;
        endpoints[1][c] = static_cast<uint8_t>(color_read_bits & ENDPOINT_MASK);
        color_read_bits >>= ENDPOINT_BITS;

I was curious why I didn't notice this bug with the other decompressor modes. It turns out that the functions unpack_bc7_mode0_2 and unpack_bc7_mode1_3_7 have the extra line "uint64_t channel_read_chunk = channel_read_chunks[c];" at the top of the outer loop before the start of the inner loop, so the compiler can't swap the order of those loops. The last function is unpack_bc7_mode6, which doesn't loop at all, since there are only 2 endpoints to decode.

Sorry for not making this a pull request, but it seemed like GitHub wanted me to clone the whole repository and let it diff the changes to find out what I changed. I just wanted to suggest a new version for a single file, but I didn't see any way to do that in their interface.

richgel999 commented 3 years ago

Thank you so much for the report! I've been encountering C/C++ compiler bugs with VC 2019, but I've been working around them. I don't test with VC2017.

richgel999 commented 3 years ago

I'll see if I can repo and add a workaround for VC 2017. Also my next big step on this project is adding OpenCL support for faster encoding (as time permits - we've been pretty slammed with Basis Universal work).