Closed yupferris closed 3 years ago
Note that I didn't actually explore where/why these are imported now and weren't before, and perhaps there's a simple workaround if we can figure that out and avoid it.
These are about signed 64 integer mismatch between the C/C++ ABI and x86. I believe we can fix them by casting to unsigned long long in the right places, at least in the first case.
The second case seems like it's a performance disaster to end up in the _allshl
codepath in the first place.
I would say some optimization rules has changed in the compiler, meaning it's no longer able to avoid these calls. I wouldn't be too surprised, x86 is no longer an important optimization target for most types of applications, so it might be a risky venture to use a new compiler for it...
I also wouldn’t be entirely against not officially supporting x86 for vs2019 onwards and pushing x64 as the default everywhere once squishy x64 is available, fwiw. I really want x64 to be default for 64k moving forward even if it’s slightly larger today.
That said, it’s still probably worth looking for workaround(s) and I also have a patch to turn these errors at runtime into link errors. I’ll explore casting to unsigned and see where that gets us.
re _allmul
/_alldiv
, I think it's enough just to make DirectSoundRenderThread::GetPlayPositionMs
return unsigned int
and perform the mul/div unsigned as follows (Edit: reading your comments above, @kusma, I see that this is exactly what you suggested!):
diff --git a/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h b/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
index f54d1d7..3d68c30 100644
--- a/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
+++ b/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
@@ -17,7 +17,7 @@ namespace WaveSabrePlayerLib
DirectSoundRenderThread(RenderCallback callback, void *callbackData, int sampleRate, int bufferSizeMs = 1000);
~DirectSoundRenderThread();
- int GetPlayPositionMs();
+ unsigned int GetPlayPositionMs();
private:
static DWORD WINAPI threadProc(LPVOID lpParameter);
diff --git a/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp b/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
index fd50a11..dae28eb 100644
--- a/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
+++ b/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
@@ -26,7 +26,7 @@ namespace WaveSabrePlayerLib
WaitForSingleObject(thread, INFINITE);
}
- int DirectSoundRenderThread::GetPlayPositionMs()
+ unsigned int DirectSoundRenderThread::GetPlayPositionMs()
{
if (!buffer)
return 0;
@@ -49,7 +49,7 @@ namespace WaveSabrePlayerLib
totalBytesRead += bufferSizeBytes;
totalBytesRead += currentBytesRendered;
- return (int)(totalBytesRead / SongRenderer::BlockAlign * 1000 / sampleRate);
+ return (unsigned int)((unsigned long long)totalBytesRead / SongRenderer::BlockAlign * 1000 / sampleRate);
}
DWORD WINAPI DirectSoundRenderThread::threadProc(LPVOID lpParameter)
This function should never return negative values, as it represents how many ms have been rendered in total. This is in contrast to its caller, RealtimePlayer::GetSongPos
, which is meant to compensate for output buffer latency, and thus may be negative (before the initial buffer capacity has played out). For that, we just use double
and operate in seconds. It might also be an idea, if it's simpler, to use the same type/precision down the whole stack - though I suspect this may be more prone to (cumulative) rounding errors over time.
Alternatively, instead of keeping total bytes rendered in 64 bits, we could keep track of two 32-bit unsigned quantities, ms rendered, and fractional ms rendered, with enough precision to match our sample rate. This is more complicated/error-prone, though (edit: it may not be; we have 53 bits in double
, so we could likely just keep track of how many samples have been rendered with that type, and return that all the way until the function that returns seconds, and just convert there, and we likely have several bits more than we need).
Getting rid of _allshl
is, surprisingly, a bit more tricky. It doesn't seem to matter whether or not phaseAsInt
is signed or unsigned (which makes sense, since it's a left shift); so far I haven't been able to work around it without decidedly losing precision.
(side note: there's a bit of irony in that reverting some of the sinusoid operations that I dug into last night may have allowed us to work around this easier, but I don't think that's a good enough reason to back out of the change still :) .)
re my comment earlier:
edit: it may not be; we have 53 bits in double, so we could likely just keep track of how many samples have been rendered with that type, and return that all the way until the function that returns seconds, and just convert there, and we likely have several bits more than we need
I looked into this, and it turns out it's not actually simpler overall, since PreRenderPlayer
also does some calculations in ms, so it's simpler to have the conversion from bytes rendered -> ms rendered in one place.
Hey wait a minute! Since we already normalized the double phase to be in the [1, 2)
range, wouldn’t that mean its exponent is fixed? So we actually shouldn’t need the left shift at all (or at the very least that it’s constant and we can factor it another way)?
Ah, that would be true but we actually only guarantee the lower bound - and ofc something like fmod
to enforce the upper bound is abysmal performance-wise.
OK, I was wrong about the GetPlayPositionMs
patch above - once I hacked a workaround for _allshl
it actually failed at runtime again, this time with __aulldiv
, again in GetPlayPositionMs
. This is feeling a bit futile.
ha, ok, I actually found a configuration that works. Turns out _allshl
isn't required with /O2
(optimize for speed), and if I use double
for all of the time tracking in DirectSoundRenderThread
, neither are the other long-related ops. It feels a bit nasty, but it's enough that I wouldn't mind putting up a PR for further feedback.
Nice catch!
Move the code to the end of the source-file and do
#pragma optimize( "s", on)
right before? :P
More seriously, we should probably look at what code is generated in the case that don't need it; maybe the compiler realize we don't really need this shift to be 64-bits for some reason?
This is with /O2
:
auto significand = (unsigned int)((phaseAsInt << exponent) >> (52 - 32));
00406A78 movq xmm1,mmword ptr [phaseAsInt]
00406A7D shr eax,14h
00406A80 sub eax,3FFh
00406A85 movd xmm0,eax
00406A89 psllq xmm1,xmm0
00406A8D psrlq xmm1,14h
00406A92 movd ecx,xmm1
The first couple lines determine the exponent
, then there's the shifts. It seems both shifts (left and right) are still there, it's just that they use SSE2 MMX.
UGH, after making the PR I decided to test x64 again, and ofc, it's broken, likely due to /O2
: unresolved external symbol __vdecl_cos2
. And if I change back to /01
, __imp_floorf
and __imp_sqrtf
are missing instead, which is actually strange, as I thought I had this working a few months ago...
Yeah so master works, yet my branch doesn't, even if I manually change WaveSabreCore
to use /O1
again and rebuild. This is really fucked.
I guess we could make the /O1
//O2
option dependent on the target platform...
OK - I've been mixing up /O1
and /Oi
options in CMakeLists.txt
, which explains why those __imp_*
fn's were missing in the x64 build after my latest changes. With both /O2
and /Oi
, it works for x86; we still need /O1
//Oi
for x64.
Or, perhaps we can shim in fpuCos
for x64 as well?
FWIW I did get this to work by enabling MASM support and defining fpuCos
for x64 (as inline assembly isn't supported by MSVC for x64), and in theory that could be made to work with cmake nicely, and also be conditionally included for x64 only. However, the resulting squished executable appears to be about 1kb larger than its /01
counterpart. So simply choosing the optimization parameter based on the target architecture might actually be the "best" solution.
Or, perhaps we can implement our own shift function in this case, just for x86. I think we can make a fair number of assumptions about the input parameters, and we can use inline asm, so we shouldn't incur any performance hits. In fact, just using MMX like the optimized version did is probably a great strategy.
This is, of course, not too far off of just implementing the original intrinsic that's missing, however it can be slightly smaller/simpler to just support our single case (just like the math functions we currently implement), and avoid potential copyright issues that we were blissfully violating before by shipping the old msvcrt.lib binary blob.
That said, were we actually violating any copyright there? Could it be a valid solution to bring msvcrt.lib back for x86 and generate it for x64?
You know what, I don’t know why I didn’t think of it sooner - we can just use MMX intrinsics to implement the shifts in a way that matches what the optimized code was doing. This way we shouldn’t have to change any optimization opts or anything, and we’ll always get faster code regardless of x86/x64.
While x64 appears to work without fuss, x86 builds seem to rely on some MSVCRT functions that we have incorrect definitions for - particularly,
_alldiv
,_allmul
, and_allshl
. I've verified that when using the oldmsvcrt_old.lib
file that we used to ship as a binary blob, we could get things to work. I also tried to shim them in by hand with code from here (with some project config changes outlined here) and that also works, though I don't just want to copy that code for copyright reasons (though I'm not sure how else we can make similar shims in a nice way, maybe in C?).To be fair, I do have working x64 squishy builds, and there was even one WS exemusic entry released using it already. It would be really great if we could just start focusing on x64 and drop x86 altogether. However, the packer alone just isn't enough; I haven't gotten x64 compression on par with x86 and I'm suspecting we won't be able to close the gap as much as I'd hoped, as the code is just denser. So, I think we're stuck supporting both for the time being.