Fix out-of-the-box vs2019 support for x86

yupferris commented 3 years ago

While x64 appears to work without fuss, x86 builds seem to rely on some MSVCRT functions that we have incorrect definitions for - particularly, _alldiv, _allmul, and _allshl. I've verified that when using the old msvcrt_old.lib file that we used to ship as a binary blob, we could get things to work. I also tried to shim them in by hand with code from here (with some project config changes outlined here) and that also works, though I don't just want to copy that code for copyright reasons (though I'm not sure how else we can make similar shims in a nice way, maybe in C?).

To be fair, I do have working x64 squishy builds, and there was even one WS exemusic entry released using it already. It would be really great if we could just start focusing on x64 and drop x86 altogether. However, the packer alone just isn't enough; I haven't gotten x64 compression on par with x86 and I'm suspecting we won't be able to close the gap as much as I'd hoped, as the code is just denser. So, I think we're stuck supporting both for the time being.

yupferris commented 3 years ago

Note that I didn't actually explore where/why these are imported now and weren't before, and perhaps there's a simple workaround if we can figure that out and avoid it.

yupferris commented 3 years ago

_alldiv and _allmul both come from this line. _allshl comes from this line.

Again, I'm not entirely sure why these didn't cause issues in previous vs versions, and perhaps it's worth looking into what the compiler used to generate and see if we can get it to do the same for vs2019.

kusma commented 3 years ago

These are about signed 64 integer mismatch between the C/C++ ABI and x86. I believe we can fix them by casting to unsigned long long in the right places, at least in the first case.

kusma commented 3 years ago

The second case seems like it's a performance disaster to end up in the _allshl codepath in the first place.

I would say some optimization rules has changed in the compiler, meaning it's no longer able to avoid these calls. I wouldn't be too surprised, x86 is no longer an important optimization target for most types of applications, so it might be a risky venture to use a new compiler for it...

yupferris commented 3 years ago

I also wouldn’t be entirely against not officially supporting x86 for vs2019 onwards and pushing x64 as the default everywhere once squishy x64 is available, fwiw. I really want x64 to be default for 64k moving forward even if it’s slightly larger today.

That said, it’s still probably worth looking for workaround(s) and I also have a patch to turn these errors at runtime into link errors. I’ll explore casting to unsigned and see where that gets us.

yupferris commented 3 years ago

re _allmul/_alldiv, I think it's enough just to make DirectSoundRenderThread::GetPlayPositionMs return unsigned int and perform the mul/div unsigned as follows (Edit: reading your comments above, @kusma, I see that this is exactly what you suggested!):

diff --git a/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h b/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
index f54d1d7..3d68c30 100644
--- a/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
+++ b/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
@@ -17,7 +17,7 @@ namespace WaveSabrePlayerLib
                DirectSoundRenderThread(RenderCallback callback, void *callbackData, int sampleRate, int bufferSizeMs = 1000);
                ~DirectSoundRenderThread();

-               int GetPlayPositionMs();
+               unsigned int GetPlayPositionMs();

        private:
                static DWORD WINAPI threadProc(LPVOID lpParameter);
diff --git a/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp b/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
index fd50a11..dae28eb 100644
--- a/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
+++ b/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
@@ -26,7 +26,7 @@ namespace WaveSabrePlayerLib
                WaitForSingleObject(thread, INFINITE);
        }

-       int DirectSoundRenderThread::GetPlayPositionMs()
+       unsigned int DirectSoundRenderThread::GetPlayPositionMs()
        {
                if (!buffer)
                        return 0;
@@ -49,7 +49,7 @@ namespace WaveSabrePlayerLib
                        totalBytesRead += bufferSizeBytes;
                totalBytesRead += currentBytesRendered;

-               return (int)(totalBytesRead / SongRenderer::BlockAlign * 1000 / sampleRate);
+               return (unsigned int)((unsigned long long)totalBytesRead / SongRenderer::BlockAlign * 1000 / sampleRate);
        }

        DWORD WINAPI DirectSoundRenderThread::threadProc(LPVOID lpParameter)

This function should never return negative values, as it represents how many ms have been rendered in total. This is in contrast to its caller, RealtimePlayer::GetSongPos, which is meant to compensate for output buffer latency, and thus may be negative (before the initial buffer capacity has played out). For that, we just use double and operate in seconds. It might also be an idea, if it's simpler, to use the same type/precision down the whole stack - though I suspect this may be more prone to (cumulative) rounding errors over time.

Alternatively, instead of keeping total bytes rendered in 64 bits, we could keep track of two 32-bit unsigned quantities, ms rendered, and fractional ms rendered, with enough precision to match our sample rate. This is more complicated/error-prone, though (edit: it may not be; we have 53 bits in double, so we could likely just keep track of how many samples have been rendered with that type, and return that all the way until the function that returns seconds, and just convert there, and we likely have several bits more than we need).

yupferris commented 3 years ago

Getting rid of _allshl is, surprisingly, a bit more tricky. It doesn't seem to matter whether or not phaseAsInt is signed or unsigned (which makes sense, since it's a left shift); so far I haven't been able to work around it without decidedly losing precision.

(side note: there's a bit of irony in that reverting some of the sinusoid operations that I dug into last night may have allowed us to work around this easier, but I don't think that's a good enough reason to back out of the change still :) .)

yupferris commented 3 years ago

re my comment earlier:

edit: it may not be; we have 53 bits in double, so we could likely just keep track of how many samples have been rendered with that type, and return that all the way until the function that returns seconds, and just convert there, and we likely have several bits more than we need

I looked into this, and it turns out it's not actually simpler overall, since PreRenderPlayer also does some calculations in ms, so it's simpler to have the conversion from bytes rendered -> ms rendered in one place.

yupferris commented 3 years ago

Hey wait a minute! Since we already normalized the double phase to be in the [1, 2) range, wouldn’t that mean its exponent is fixed? So we actually shouldn’t need the left shift at all (or at the very least that it’s constant and we can factor it another way)?

yupferris commented 3 years ago

Ah, that would be true but we actually only guarantee the lower bound - and ofc something like fmod to enforce the upper bound is abysmal performance-wise.

yupferris commented 3 years ago

OK, I was wrong about the GetPlayPositionMs patch above - once I hacked a workaround for _allshl it actually failed at runtime again, this time with __aulldiv, again in GetPlayPositionMs. This is feeling a bit futile.

yupferris commented 3 years ago

ha, ok, I actually found a configuration that works. Turns out _allshl isn't required with /O2 (optimize for speed), and if I use double for all of the time tracking in DirectSoundRenderThread, neither are the other long-related ops. It feels a bit nasty, but it's enough that I wouldn't mind putting up a PR for further feedback.

kusma commented 3 years ago

Nice catch! Move the code to the end of the source-file and do #pragma optimize( "s", on) right before? :P

kusma commented 3 years ago

More seriously, we should probably look at what code is generated in the case that don't need it; maybe the compiler realize we don't really need this shift to be 64-bits for some reason?

yupferris commented 3 years ago

This is with /O2:

        auto significand = (unsigned int)((phaseAsInt << exponent) >> (52 - 32));
00406A78  movq        xmm1,mmword ptr [phaseAsInt]  
00406A7D  shr         eax,14h  
00406A80  sub         eax,3FFh  
00406A85  movd        xmm0,eax  
00406A89  psllq       xmm1,xmm0  
00406A8D  psrlq       xmm1,14h  
00406A92  movd        ecx,xmm1

The first couple lines determine the exponent, then there's the shifts. It seems both shifts (left and right) are still there, it's just that they use ~~SSE2~~ MMX.

yupferris commented 3 years ago

UGH, after making the PR I decided to test x64 again, and ofc, it's broken, likely due to /O2: unresolved external symbol __vdecl_cos2. And if I change back to /01, __imp_floorf and __imp_sqrtf are missing instead, which is actually strange, as I thought I had this working a few months ago...

yupferris commented 3 years ago

Yeah so master works, yet my branch doesn't, even if I manually change WaveSabreCore to use /O1 again and rebuild. This is really fucked.

yupferris commented 3 years ago

I guess we could make the /O1//O2 option dependent on the target platform...

yupferris commented 3 years ago

OK - I've been mixing up /O1 and /Oi options in CMakeLists.txt, which explains why those __imp_* fn's were missing in the x64 build after my latest changes. With both /O2 and /Oi, it works for x86; we still need /O1//Oi for x64.

Or, perhaps we can shim in fpuCos for x64 as well?

yupferris commented 3 years ago

FWIW I did get this to work by enabling MASM support and defining fpuCos for x64 (as inline assembly isn't supported by MSVC for x64), and in theory that could be made to work with cmake nicely, and also be conditionally included for x64 only. However, the resulting squished executable appears to be about 1kb larger than its /01 counterpart. So simply choosing the optimization parameter based on the target architecture might actually be the "best" solution.

yupferris commented 3 years ago

Or, perhaps we can implement our own shift function in this case, just for x86. I think we can make a fair number of assumptions about the input parameters, and we can use inline asm, so we shouldn't incur any performance hits. In fact, just using MMX like the optimized version did is probably a great strategy.

This is, of course, not too far off of just implementing the original intrinsic that's missing, however it can be slightly smaller/simpler to just support our single case (just like the math functions we currently implement), and avoid potential copyright issues that we were blissfully violating before by shipping the old msvcrt.lib binary blob.

That said, were we actually violating any copyright there? Could it be a valid solution to bring msvcrt.lib back for x86 and generate it for x64?

yupferris commented 3 years ago

You know what, I don’t know why I didn’t think of it sooner - we can just use MMX intrinsics to implement the shifts in a way that matches what the optimized code was doing. This way we shouldn’t have to change any optimization opts or anything, and we’ll always get faster code regardless of x86/x64.

logicomacorp / WaveSabre

Fix out-of-the-box vs2019 support for x86 #59