Fix broken C version of MulMatricesC

inactive123 commented 9 years ago

Me and cxd4 (Hatcat/mephostophilis) stumbled upon this while backporting some stuff to the libretro version.

The C version of MulMatricesC appears to be totally broken. I guess this went unnoticed before because it would default to the SSE2 codepath version on regular desktops.

Read the explanation by cxd4 here:

https://github.com/libretro/mupen64plus-libretro/pull/159

' repairs were made to the ANSI C code which had broken texture elements in the backgrounds of games like Super Mario 64. In addition, the new ANSI C function is written in vectorized loop style, such that GCC can very easily create the SSE instructions relevant to said loops in a manner perhaps more optimized than the original SSE-fixed code.'

littleguy77 commented 9 years ago

Maybe I'm missing something, but I don't see any defects in the original code:

void MulMatricesC(float m1[4][4],float m2[4][4],float r[4][4])
{
    for (int i=0; i<4; i++)
    {
        for (int j=0; j<4; j++)
        {
            r[i][j] = m1[i][0] * m2[0][j] +
                      m1[i][1] * m2[1][j] +
                      m1[i][2] * m2[2][j] +
                      m1[i][3] * m2[3][j];
        }
    }
}

littleguy77 commented 9 years ago

yikes... chill @fayvel

Can we agree that this pull request is an optimization, not a fix.

Narann commented 9 years ago

@fayvel I'm always open to discussion with people from other N64 projects, especially when they try to contribute to the upstream. N64 scene is too small to let people rant over each others. :)

@twinaphex thanks for contributing back but as @littleguy77 say, the code seems to work. I suggest to look at the potential result of your PR.

I think your Glide64 version is not based on mk2 but the old glide64. Your PR would have more sense on this repo.

inactive123 commented 9 years ago

OK, forget about getting anymore upstream patches after this. You can thank @fayvel for that.

inactive123 commented 9 years ago

I really don't know what you're talking about, and with your passive-aggressive attitude on display here, I don't think I really would want to know anyway.

inactive123 commented 9 years ago

You create a hostile environment that I really want no part in and I have better things to do than to negotiate with people with anger management issues. It has nothing to do with 'punishing anybody'. The fork is opensource anyway so if they want anything they can take it.

ghost commented 9 years ago

Can someone please merge the patch from @twinaphex. Thanks

inactive123 commented 9 years ago

This is not about the patch not having been merged or not, I don't care if it is merged or not, I was just trying to be helpful and in any case, the author of that patch (@cxd4) provided plenty of reasons for it being an improvement. But this is not about that. This is about your choice of words in that comment you have now removed. I'd appreciate it if you would let it just stay up instead of trying to act like you said nothing bad.

littleguy77 commented 9 years ago

lame @fayvel

ghost commented 9 years ago

@twinaphex I am terrible sorry. What can I do to make all this go away? Can I give you a present or anything else?

I will try to not say anything bad again about the work of other people

inactive123 commented 9 years ago

You can say bad things for sure, but you don't have to be insulting and confrontational about it.

Anyway, I'll pretend this didn't happen for this time and I'll assume good intent from now on.

Narann commented 9 years ago

I've hesitated to remove the original @fayvel message at the moment I saw it but I didn't because I don't like this kind censorship and I prefer discussion.

But, I'm tired of all the angriness around the N64 scene. This angriness that make every N64 devs want to work away from other devs while I deeply think we should work together to make N64 the emulator it deserve. I spend some time trying to discuss and involve devs from other N64 project in the upstream so any upstream contribution should be appreciate and encouraged.

I will be less diplomatic next time I see such attack and would remove it instantly.

Sorry for the bad noise @twinaphex. Any upstream patch from you or your team is appreciate. If it would require some work, I would recommend to create an issue pointing your internal discussion. This would avoid your team to spend time on something maybe already fixed.

littleguy77 commented 9 years ago

This PR still represents a potential optimization. Anyone know any good test cases where MulMatricesC is a bottleneck?

littleguy77 commented 9 years ago

Agree with @Narann. There are too few devs to waste on pointless drama. Particularly good devs like @twinaphex -- who has already contributed numerous critical fixes to long-standing bugs back to upstream.

Narann commented 9 years ago

I was blocked on the fact this PR was put on the bad Glide64 version and miss the auto vectorization feature it provide.

Narann commented 9 years ago

Thanks! @twinaphex I will merge now and comment why we do things like this (simply add "this make the code easy to vectorize for the compiler").

I encourage devs to comment when they write code is a unnatural way for specific optimization purpose.

cxd4 commented 9 years ago

Inevitably, they will. Auto-vectorization is still a rather sensitive science. You can usually count on new releases of Microsoft Visual C compilers to be stubborn and predictable in that regard (also less effective overall), whereas GCC it is always ever-improving...and ever-worsening! I had to compile my older RSP code in GCC 4.7.2 instead of GCC 4.8 due to the regressions outweighing the improvements when it came to auto-vectorization, so it's always best if we keep it simple.

Your first asm blob result--the first option that I was already doing with those 4 separate loops--seems to be the result of GCC failing to pack the floating-points into parallel streaming. You keep seeing loads of "movss" and "mulss" throughout that overly large asm result, when really, it should be "mulps". Rather than multiplying 4 floats in a single loop, it's doing 4 individual mulss operations, per each loop.

The second asm blob result is still the result of your method you suggested earlier, and the result of my method from before on GCC 4.8.2. It seems you were right after all to suggest this improvement to my loops--I will do another pull request to twinaphex shortly crediting you.

Now, the third asm result you gave, IS succeeding at packing the results in parallel SIMD, but failing at shuffling by your outer loop variable, k, causing a lack of static compile-time prediction when organizing the shufps shuffle immediates. It's supposed to move one 32-bit FP into the low 32 bits of the xmm, and then shuffle it across bits 63..32, 95..64, and 127..96 of the XMM using SHUFPS xmm0, xmm0, 00h, but is instead trying to bend the 8-bit shuffle immediate over to each case offset. So it is half successful, ironically. Why option 3 is half-successful whereas my original option 1 didn't pack at all does go beyond me, though.

cxd4 commented 9 years ago

And thanks for sharing all that, by the way. Now I'm giving it a go in MSVC 2013's standard 64-bit Release mode optimizations.

MSVC is a bit slower with improvements to vectorization than GCC is, but it is also more steady and stubborn about it. MSVC 2010 and earlier had no auto-vectorization support whatsoever from what I could tell, so I have to use 2013 for this. (Not sure who has MSVC 2012?)

option 1 (my original loops) gave:

main PROC
    movss   xmm2, DWORD PTR leftrow
    movss   xmm3, DWORD PTR leftrow+4
    xor     eax, eax
    movaps  xmm0, xmm2
    movaps  xmm1, xmm2
    mulss   xmm0, DWORD PTR row
    mulss   xmm1, DWORD PTR row+4
    movss   DWORD PTR summand, xmm0
    movaps  xmm0, xmm2
    movss   DWORD PTR summand+4, xmm1
    movaps  xmm1, xmm3
    mulss   xmm0, DWORD PTR row+8
    mulss   xmm1, DWORD PTR row+20
    mulss   xmm2, DWORD PTR row+12
    movss   DWORD PTR summand+8, xmm0
    movaps  xmm0, xmm3
    movss   DWORD PTR summand+20, xmm1
    movss   DWORD PTR summand+12, xmm2
    movss   xmm2, DWORD PTR leftrow+8
    mulss   xmm0, DWORD PTR row+16
    movaps  xmm1, xmm2
    mulss   xmm1, DWORD PTR row+36
    movss   DWORD PTR summand+16, xmm0
    movaps  xmm0, xmm3
    movss   DWORD PTR summand+36, xmm1
    mulss   xmm0, DWORD PTR row+24
    mulss   xmm3, DWORD PTR row+28
    movss   DWORD PTR summand+24, xmm0
    movaps  xmm0, xmm2
    movss   DWORD PTR summand+28, xmm3
    movss   xmm3, DWORD PTR leftrow+12
    mulss   xmm0, DWORD PTR row+32
    movaps  xmm1, xmm3
    mulss   xmm1, DWORD PTR row+52
    movss   DWORD PTR summand+32, xmm0
    movaps  xmm0, xmm2
    mulss   xmm0, DWORD PTR row+40
    mulss   xmm2, DWORD PTR row+44
    movss   DWORD PTR summand+52, xmm1
    movss   DWORD PTR summand+40, xmm0
    movaps  xmm0, xmm3
    movss   DWORD PTR summand+44, xmm2
    mulss   xmm0, DWORD PTR row+48
    movss   DWORD PTR summand+48, xmm0
    movaps  xmm0, xmm3
    mulss   xmm3, DWORD PTR row+60
    mulss   xmm0, DWORD PTR row+56
    movss   DWORD PTR summand+60, xmm3
    movss   DWORD PTR summand+56, xmm0
    ret 0

If I'm not mis-counting the op-codes, it's quite the same situation as your GCC 4.4 output for that C.

option 2 (your new loop) gave:

main PROC
    movss   xmm2, DWORD PTR leftrow
    movups  xmm0, XMMWORD PTR row
    movss   xmm3, DWORD PTR leftrow+4
    xor     eax, eax
    shufps  xmm2, xmm2, 0
    shufps  xmm3, xmm3, 0
    mulps   xmm0, xmm2
    movss   xmm2, DWORD PTR leftrow+8
    movups  XMMWORD PTR summand, xmm0
    movups  xmm0, XMMWORD PTR row+16
    shufps  xmm2, xmm2, 0
    mulps   xmm0, xmm3
    movss   xmm3, DWORD PTR leftrow+12
    movups  XMMWORD PTR summand+16, xmm0
    movups  xmm0, XMMWORD PTR row+32
    shufps  xmm3, xmm3, 0
    mulps   xmm0, xmm2
    movups  XMMWORD PTR summand+32, xmm0
    movups  xmm0, XMMWORD PTR row+48
    mulps   xmm0, xmm3
    movups  XMMWORD PTR summand+48, xmm0
    ret

Quite comparable to your output with GCC 4.4, except with 4 additional instructions due to some extra moves. However it is clearly as successful.

Now the third option with the loop inside of a loop we both liked:

main PROC
    xor     eax, eax
    lea     rcx, OFFSET FLAT:leftrow
    lea     r8, OFFSET FLAT:__ImageBase
    lea     edx, QWORD PTR [rax+4]
$LL6@main:
    movss   xmm0, DWORD PTR [rcx]
    movups  xmm1, XMMWORD PTR row[rax+r8]
    lea     rax, QWORD PTR [rax+16]
    lea     rcx, QWORD PTR [rcx+4]
    shufps  xmm0, xmm0, 0
    mulps   xmm1, xmm0
    movups  XMMWORD PTR summand[rax+r8-16], xmm1
    dec     rdx
    jne     SHORT $LL6@main
    xor     eax, eax
    ret     0

...fails actually in unrolling the loop (although GCC 4.4 may have succeeded at this, it didn't completely unroll the loop in a well-enough way that it also reorganized the shuffling optimally), but succeeds in every other sense. In terms of readable assembly code it is interesting to look at nonetheless.

MSVC also continues to refuse to unroll the loop regardless of what optimization settings I try.

What interests me however, is that MSVC succeeded where GCC failed: It did shufps by 0, whereas GCC kept trying to change the immediate and inversely impacting the efficiency. Still, while MSVC is sometimes more intelligent, it tends to generate slightly extra instructions than it needs to.

Narann commented 9 years ago

Thanks for those valuable information. The conclusion to this seems to still be "write your code in order to make it humanly readable as compilers will never succeed to optimize it every time anyway".

I guess it's even more true as the number of instruction set archs tend to increase....

cxd4 commented 9 years ago

That's what C is best at. As the number of instruction set archs increases, individual compilers become more and more obligated to treat cross-platform, "humanly readable" code as being the same exact thing as "optimized" code. :) Of course, it will take many years before those two really do coincide as I'm sure you've seen. But it would be lovely for them to mean the same exact thing, wouldn't it? We're obviously not there now, but we're much closer than we were in the 90s.

There's a lot of fundamental truth to that philosophy you pose, but besides readability, accuracy is important also. Sometimes, if we refuse to write code for any convenience to the compiler's ability to optimize, and only for human readability, the end code is less accurate. For example, if we don't tempt the compiler into vectorizing readable code, even if sometimes by making it slightly less readable, we might not get any SIMD emulation of N64 RCP's SIMD at all. Never minding that this is slower: It is less accurate as well in concept. Vector CPU should be emulated by another vector CPU--besides being faster, I see it as accuracy. Even the use of even less readable, non-portable SSE intrinsic code can be justifiable under that logic ("accuracy" vs. "readability" of alg.), if we have no alternatives. But thankfully, we do!

But anyway, I'm trying not to spam this conversation's length with so much chatting. I just figured I'd drop a note--thanks to your information I sent twinaphex another pull request here. Figured I'd let you know just in case you thought it had any other improvement to readability. https://github.com/libretro/mupen64plus-libretro/pull/170

Narann commented 9 years ago

I agree, thanks for the discussion. If you are interested in GCC vectorization, few links and look for -ftree-vectorize and -ftree-vectorizer-verbose=2.

Keep the good work!

mupen64plus / mupen64plus-video-glide64mk2

Fix broken C version of MulMatricesC #37