narzoul / DDrawCompat

DirectDraw and Direct3D 1-7 compatibility, performance and visual enhancements for Windows Vista, 7, 8, 10 and 11
BSD Zero Clause License
878 stars 67 forks source link

Release DDThreadLock during the FPS limit spinlock #301

Closed CookiePLMonster closed 1 month ago

CookiePLMonster commented 2 months ago

Hey! We're working on Buccaneer that benefits greatly from DDrawCompat, and we use it to eliminate any compatibility issues the game had, and to limit the FPS via FpsLimiter=flipend(30). With this setup, we've noticed that the loading times in the game are prohibitively long and random, sometimes taking up to several minutes.

I dug into the game code and found out something rather horrifying - during the loading sequences, the game performs DDraw calls from two threads:

Now, Buccaneer is a horribly coded game, the worst one I have seen so far, but there is a way to improve this scenario - DDrawCompat takes a DDLock in every single wrapped call, effectively serializing use of DirectDraw. However, that lock is not relinquished during the busy loop in RealPrimarySurface::waitForFlipFpsLimit. In this PR, I introduced a scoped DDLock unlock to release the lock during this busy loop.

With this PR, loading times in Buccaneer are greatly improved, I can hardly spot the loading times at all. That said, I am not 100% sure if releasing the lock here is safe - perhaps the lock needs to be reinstated for the RealPrimarySurface::flush call inside both busy loops? The PR worked fine as-is for us in Buccaneer, but not every game may be this lucky.

narzoul commented 2 months ago

Did you check that FpsLimiter=msgloop is not appropriate?

The changes do look concerning, doesn't this exactly demonstrate how this breaks the FPS limiter, by allowing another thread to render at an unrestricted frame rate? I'm also not sure if it's safe to release the locks there, sounds like it could lead to deadlocks.

If you have the source code of the game, why not fix the issue there?

I'm also fine with shipping the game with a modified version of DDrawCompat, the license terms are intentionally as permissive as possible.

CookiePLMonster commented 2 months ago

Did you check that FpsLimiter=msgloop is not appropriate?

The game has ~140 message loops (really), so this might be risky, but I'll try.

The changes do look concerning, doesn't this exactly demonstrate how this breaks the FPS limiter, by allowing another thread to render at an unrestricted frame rate? I'm also not sure if it's safe to release the locks there, sounds like it could lead to deadlocks.

With the exception of the call to flush(), the locks are only released for a busy loop, allowing other threads to process resources. These wait functions are only called in Flip and Blt functions, so rendering should still be restricted. Rendering and resource creation should not be restricted by the FPS limiter, only flipping/blitting should.

If you have the source code of the game, why not fix the issue there?

I don't. Even if I did, compiling a 1997 game using Direct3D Retained Mode is probably quite a challenging task.

narzoul commented 2 months ago

With the exception of the call to flush(), the locks are only released for a busy loop, allowing other threads to process resources. These wait functions are only called in Flip and Blt functions, so rendering should still be restricted. Rendering and resource creation should not be restricted by the FPS limiter, only flipping/blitting should.

Blt is not limited by FpsLimiter. The purpose of waitForFlip() is not to limit FPS. It's to wait for a pending asynchronous flip with v-sync to finish, if the destination surface is involved in such a flip. This is also the native DDraw behavior, at least as long as the wait flag is specified, otherwise the function is supposed to return DDERR_WASSTILLDRAWWING. But I think I found the latter to be problematic and removed it from DDrawCompat, so now these function (plus Lock and GetDC) always wait.

Is the unlock really necessary in waitForFlip()? I think it could be worked around by setting VSync=off, in case the game uses v-sync by default.

CookiePLMonster commented 2 months ago

Blt is not limited by FpsLimiter. The purpose of waitForFlip() is not to limit FPS. It's to wait for a pending asynchronous flip with v-sync to finish, if the destination surface is involved in such a flip.

Ah, in this case it's my mistake, I just failed to realize that. You're right, it's better to keep the lock active for those calls - it's also the behaviour of newer APIs, for example Direct3D9's Present keeps the D3D lock active for the entire duration of the wait too.

Is the unlock really necessary in waitForFlip()? I think it could be worked around by setting VSync=off, in case the game uses v-sync by default.

With this info, I tried it again with the unlock removed from there - loading times became slightly longer, but that is expected. They are still uncomparably better compared to before the PR, so I would be entirely fine with keeping the unlock only inside waitForFlipFpsLimit() if that makes things safer and more correct.

This behaviour makes sense - the FPS limit lifts the thread priority to time critical, and DDrawCompat forces the CPU affinity to one core by default (we've not reconfigured that - the game is written so badly I wouldn't dare let it run freely across cores), which effectively meant the FPS loop was blocking everything else happening on other threads. With this change, our loading times are still fast and consistent, which arguably is even more important (with the lock, we'd observe extreme variance on the loading times, ranging from a few seconds to 2+ minutes).

narzoul commented 1 month ago

With this info, I tried it again with the unlock removed from there - loading times became slightly longer, but that is expected.

Why is it expected? That function doesn't wait at all, at least not with VSync disabled. If it still waits, that would be a bug, but I can't see anything wrong in the code there from a quick review.

This behaviour makes sense - the FPS limit lifts the thread priority to time critical, and DDrawCompat forces the CPU affinity to one core by default (we've not reconfigured that - the game is written so badly I wouldn't dare let it run freely across cores), which effectively meant the FPS loop was blocking everything else happening on other threads.

The FPS loops are not busy waiting, except within the last millisecond of the wait for extra accuracy. So for the most part, the waiting loops shouldn't block other threads, except those that are waiting on the DDThreadLock, unless the waiting time is always less than 1 ms. In retrospect, maybe the raised thread priority is not needed anyway, the idea was just to keep the accuracy of the wait as high as possible. But if a higher prio thread wants to keep the CPU busy, it can do so anyway as soon as the thread prio is lowered, so in the end, it probably makes no difference.

Did you check that FpsLimiter=msgloop is not appropriate?

Any luck with this? I've decided not to accept PRs for this project, but if msgloop doesn't solve the issue, I'll try adding some enhancement similar to your suggestion.

CookiePLMonster commented 1 month ago

Why is it expected? That function doesn't wait at all, at least not with VSync disabled. If it still waits, that would be a bug, but I can't see anything wrong in the code there from a quick review.

I conducted my re-test with VSync enabled, so it makes sense that loading times got slower when they were still capped to 30FPS.

The FPS loops are not busy waiting, except within the last millisecond of the wait for extra accuracy. So for the most part, the waiting loops shouldn't block other threads, except those that are waiting on the DDThreadLock

That's precisely what is happening in Buccaneer during the loading screens - they spin up a separate thread to present in a busy loop, while the main thread creates DDraw resources. Without my change, this leads to the "render thread" blocking the main thread on its own FPS limiter, which leads to our loading times being extraordinarily slow. With my change, threads block only for VSync (as per your suggestion) which would be equivalent to the original DDraw behaviour, without the "extra" FPS limiter imposing a block.

Any luck with this? I've decided not to accept PRs for this project, but if msgloop doesn't solve the issue, I'll try adding some enhancement similar to your suggestion.

I tried it now, and as I anticipated, msgloop(30) isn't appropriate for this game. Their message handling is confusing and not "by the book", as they appear to have a separate small message loop for every single context in the game they need input for (140+ occurrences of PeekMessageA in the game's binary), and they process it immediately, without dispatching to the message proc. I observed two issues:

None of those issues manifested with flipend(30), so I think for Buccaneer we have no other choice but to keep it as-is, with my unlock change on top.


For the time being, we're going to stick to this solution, once/if you devise your own enhancements I'll then re-evaluate it. That said, considering this PR's simplicity I don't necessarily see the point in not getting it in as-is, unless you disagree with the principle of it. For all I care you could even close that PR and copypaste these changes as your own commit.

EDIT: Looking at my PR again, I think g_qpcDelayedFlipEnd = Time::queryPerformanceCounter(); should be done under a lock. I might need to introduce a scope to exclude this one line from running with the lock released.

narzoul commented 1 month ago

The reason this doesn't really add up is that the issue should be then present with native DirectDraw too, since it would also wait in Flip (and maybe in other operations, like Blt), while holding the DDThreadLock. The only difference is that the wait time would be synced to 60Hz instead of your desired setting of 30.

Anyway, instead of continuing to guess here, I checked out the game for myself. It turns out the only change needed to fix the long loading screens (at least on my system) is to set ThreadPriorityBoost=on in DDrawCompat.ini. This is probably what the game has always depended on. It's normally enabled by default in Windows, but I found that turning it off resolves heavy stuttering in some games, hence why the default value is off in DDrawCompat. It does cause scheduling issues in a few cases though, like this one.

CookiePLMonster commented 1 month ago

This is probably what the game has always depended on

Was it a thing in 1997?

EDIT: That said yes, I reverted to a mainline v0.5.2 and added ThreadPriorityBoost=on on top of our existing FpsLimiter=flipend(30) and the loading times are now equally fast to my modified DDrawCompat. That is a very interesting result to say the least.

With this in mind, do you wish this PR to be closed? The build with a modified DDrawCompat v0.5.2 has not been QA'd yet, so I can swap out my modded DLL for the mainline one + INI option, and get it tested this way.

narzoul commented 1 month ago

Was it a thing in 1997?

Yes, according to this: https://beej.us/guide/win95sch.html

With this in mind, do you wish this PR to be closed?

It can be closed, yes.

CookiePLMonster commented 1 month ago

Fair enough, thank you for looking into this! On paper, priority boosts have a chance to improve loading times even better than my PR, considering that my changes had no effect on vsync, while priority boosts might.

I will keep the code around though, just in case - if the QA session reveals loading times are still random, I'll have them try my modified DLL, but if tests pass with only ThreadPriorityBoost=on, I'll discard it.