godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
91.38k stars 21.26k forks source link

Huge CPU consumption comparing in 4.4.dev5 compared to 4.3.stable #99593

Closed caioraphael1 closed 3 hours ago

caioraphael1 commented 2 days ago

Tested versions

v4.3-stable to 4.4-dev5

System information

Ryzen 3300x, 16gb RAM, GTX 1660, Windows 10, installed on a SSD

Issue description

I have some extremely simple projects to open and join a High-level multiplayer server with ENet. Both projects only have a _ready function to connect to each other and are set in Compatibility Render.

In v4.3, the editor and the app running both consume ~0.1% CPU, as shown at the top of the screenshot.

Meanwhile, in v4.4-dev5, each project consumes 15-18% CPU while in the editor or with the project opened, for a total of 30-36% CPU consumption if considering the editor and project together. This is shown at the bottom of the screenshot. If I tried exporting the project in v4.4-dev5 for win10, the consumption of the exported app is the same.

Image

I downgraded the projects back to v4.3-stable and the consumption dropped back to 0.1%.

This is happening in 4 different projects, with a mix of Forward+ and Compatibility.

I have to keep some of my projects in 4.4-dev5, as I new some of the new features for my network setup.

Steps to reproduce

Upgrade to 4.4-dev5

Minimal reproduction project (MRP)

The projects contain sensitive data.

xiaoliang6669 commented 1 day ago

Yes, that's the case for me too

Lippanon commented 1 day ago

Please consider stripping down one of the projects down to a reasonable minimum that still displays the regressive behaviour, or if you can build from source, bisecting it, otherwise it is very difficult to locate the regression source.

caioraphael1 commented 1 day ago

demo_gateway.zip

I've removed the DTLS certificates and reduced the project to a minimum. The bug still happens.

YYF233333 commented 1 day ago

I can reproduce this in an empty project. Here is what I got from VS profiler:

Image

99178 change the implementation of OS_Windows::delay_usec, previously it use Sleep, now it use a loop wait. Tested locally, this (df3367f334) is the first commit it appear.

Profile Data: busywait.zip

Lippanon commented 1 day ago

Copying relevant snippet from the PR:

"Still, this is the most precision we can possibly get here.

The problem with this is that spin-locking eats up CPU time and does absolutely nothing. This time could have been better used by other threads that actually have usefull work to do. Or if no other threads are running, the CPU core could have been switched into low-power mode in order to save power and battery life on mobile systems."

https://blog.bearcats.nl/accurate-sleep-function/

This trade-off of CPU usage for precision is likely expected, but perhaps it would be sensible (if possible) to have a project setting to toggle implementations?

caioraphael1 commented 1 day ago

This trade-off of CPU usage for precision is likely expected, but perhaps it would be sensible (if possible) to have a project setting to toggle implementations?

I don't know if the change in this PR is vital, but at least I don't think this should be enabled by default. Opening 2 simple projects editor and running both is making me go to 100% CPU consumption. I can't actually work on any of those projects, it's extremely laggy.

YYF233333 commented 1 day ago

Doing a quick search and I understand the situation. Win 10 has Waitable Timer which is efficient and precise, but we need to support Win 7, so a loop with YieldProcessor() is used, hoping it will release CPU. (seems not work on my computer)

I'd suggest use Waitable Timer for Win 10 and above, use busy wait only for Win 7.

YYF233333 commented 1 day ago

Doing a quick search and I understand the situation. Win 10 has Waitable Timer which is efficient and precise, but we need to support Win 7, so a loop with YieldProcessor() is used, hoping it will release CPU. (seems not work on my computer)

I'd suggest use Waitable Timer for Win 10 and above, use busy wait only for Win 7.

@caioraphael1 In case you need it, this is master branch with #99178 reverted: https://drive.google.com/file/d/1fjvpnflQ-taHSkgLCZX86xfz9bQzbICv/view?usp=sharing

caioraphael1 commented 1 day ago

@YYF233333 Thank you!!! CPU consumption is now at 25-35%, under the same conditions.

clayjohn commented 1 day ago

@YYF233333 The results are very interesting. In theory the change in https://github.com/godotengine/godot/pull/99178 should never spin with YieldProcessor() for more than 1.2 ms per frame.

Although, now that I am looking at the usages of delay_usec(), it seems like the audio driver often tries to delay by exactly 1ms. With this change, instead of sleeping for 1 ms, we spin for 1 ms because the threshold is 1.2 ms now.

I guess the proper solution will be to lower the threshold so that we actually sleep in such cases

What do you think @mrsaturnsan?

YYF233333 commented 1 day ago

In theory the change in #99178 should never spin with YieldProcessor() for more than 1.2 ms per frame.

Actually, no more than 2.02 ms (you need to sleep at least 1 ms).

~And I think the threshold is problematic. The precision of Sleep() is about 15.6 ms by default, means Sleep(x) actually sleep between x ms ~ x + 15.6 ms. So if you want to sleep 15ms or shorter, you cannot afford even once Sleep(). That tolerance should be 15000(us).~

Ok, I see that timeBeginPeriod(1), my bad.

mrsaturnsan commented 1 day ago

If precise sleep by default is causing too much spinning on the CPU, my change can be reverted and the precise sleeping can be moved to the FPS limiting code. Reason being that changing the threshold will throw off the frame-limiting and give inconsistent FPS again.

I think that should be a good balance?

Alternatively, we can add an imprecise and precise sleep function. Places that don't need accuracy can use the imprecise sleep.

mrsaturnsan commented 1 day ago

Actually, it might be interesting to revisit where delay_usec is getting called from.

@clayjohn What do you think about making a precise_delay_usec and a delay_usec function?

mrsaturnsan commented 1 day ago

Although, I'm not sure why you are seeing such high CPU consumption. Does your architecture not support yielding?

mrsaturnsan commented 1 day ago

Regarding YieldProcessor:

Signals to the processor to give resources to threads that are waiting for them. This macro is only effective on processors that support technology allowing multiple threads running on a single processor, such as Intel's Hyperthreading technology.

https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-yieldprocessor

mrsaturnsan commented 23 hours ago

Created a PR for fixing the unintended CPU usage cost: https://github.com/godotengine/godot/pull/99656

stesproject commented 13 hours ago

I add that the problem only occurs from dev5. I used dev4 until the other day and there wasn't this huge CPU consumption with the previous build.

YYF233333 commented 12 hours ago

By the way, @mrsaturnsan I'd like to ask what's the original purpose of #99178? It doesn't link to an issue, and I haven't ran into problem with framerate, looks a bit odd to me. Why we need so stable frametime?

Calinou commented 10 hours ago

Why we need so stable frametime?

Unstable frametimes result in jittery motion. VRR displays can attenuate this somewhat but you get VRR flicker when frametimes are unstable, due to the gamma curve constantly changing ever so slightly as the monitor adjusts its refresh rate.

Ideally, you should strive for the frametime graph to be a perfectly flat line, no matter whether it's limited by V-Sync or a FPS limiter.

Low-latency approaches like NVIDIA Reflex or NULL[^1] intentionally introduce some amount of microstuttering to reduce latency as much as possible, but this is a compromise that's best suited to VRR displays (which not everyone has) and competitive games.

[^1]: NVIDIA Ultra Low Latency mode (affects Direct3D 11 games when the low-latency mode is set to "Ultra" in NVIDIA Control Panel).

mrsaturnsan commented 9 hours ago

@YYF233333 Frame-rate locking was not stable. For example, locking to 30FPS on a 60Hz+ display would give an unstable frame-rate and cause jittering. A smooth locked 30FPS is a much better experience than an unsmooth 30FPS.

The previous sleep logic was not accurate, and was miss the refresh interval leading to the jittery frame-pacing.

YYF233333 commented 5 hours ago

I guess the proper solution will be to lower the threshold so that we actually sleep in such cases

Back with test results, run 10k times for each case, Period is the delay specified, and Tolerance is that param in the code.

Worst case seems to have no difference, for P99 you can see that max error is 1ms - tolerance.

Image Image

mrsaturnsan commented 4 hours ago

@YYF233333 Can you confirm if the issue is fixed with my PR?

YYF233333 commented 4 hours ago

@YYF233333 Can you confirm if the issue is fixed with my PR?

Yes, #99656 fix this.