High (100%) CPU usage on Linux/Nvidia, even with vsync enabled

64kramsystem commented 3 years ago

Seems to be the same issue as #257, but leaving here for tracking purposes in case somebody searches.

On my system, any Macroquad program causes CPU to run at 100%+. This is unique to Macroquad - all the other engines I've tried don't have this problem (Amethist/Tetra/Ggez).

I'm on Ubuntu 20.04, and vsync is enabled.

ar37-rs commented 3 years ago

for temporary solution, put main thread to sleep for a while after next_frame: next_frame().await; std:: thread ::sleep(Duration::from_millis(10));

64kramsystem commented 2 years ago

(Cross-posted from #257, as this is the correct location).

Got this on Linux.

The backtrace shows that glFinish() ends with the nvidia driver calling sched_yield(), presumably in a busy-wait loop. (To confirm, with strace I'm counting 60k sched_yield syscalls per second.)

As suggested in #257 (comment) this seems to be a nvidia driver issue. A web search finds related issues in a couple of projects (e.g. gnome), but I don't understand what the solution was.

With the information at hand (thanks @martinxyz), and after some digging, I think it's an overlook ("bug-ish" sort of thing) in MQ.

I've checked the glFinish() invocation history, and it is used for telemetry:

commit 05e69c49c5e8b5281f7f6645882168628e45b89f
Author: Fedor Logachev <not.fl3@gmail.com>
Date:   Thu Jan 7 12:11:04 2021 -0600

    telemetry: More internal metrics

diff --git a/profiler/src/lib.rs b/profiler/src/lib.rs
index 0317101..4286e55 100644
--- a/profiler/src/lib.rs
+++ b/profiler/src/lib.rs
@@ -188,14 +188,20 @@ fn zone_ui(ui: &mut Ui, zone: &Zone, n: usize) {
                     }
                 }

-                ui.label(
-                    None,
-                    &format!(
-                        "Full frame time: {:.3}ms {:.1}(1/t)",
-                        get_frame_time(),
-                        (1.0 / get_frame_time())
-                    ),
-                );
+                if let Some(frame) = state
+                    .selected_frame
+                    .as_ref()
+                    .or_else(|| state.frames_buffer.get(0))
+                {
+                    ui.label(
+                        None,
+                        &format!(
+                            "Full frame time: {:.3}ms {:.1}(1/t)",
+                            frame.full_frame_time,
+                            (1.0 / frame.full_frame_time)
+                        ),
+                    );
+                }

                 if state.paused {
                     if ui.button(None, "resume") {
diff --git a/src/lib.rs b/src/lib.rs
index 486e01c..4e3bd7f 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -247,6 +247,8 @@ fn draw(&mut self) {
             let _z = telemetry::ZoneGuard::new("Event::draw");

             if let Some(future) = unsafe { MAIN_FUTURE.as_mut() } {
+                let _z = telemetry::ZoneGuard::new("Main loop");
+
                 get_context().begin_frame();

                 if exec::resume(future) {
@@ -263,6 +265,16 @@ fn draw(&mut self) {

             get_context().frame_time = date::now() - get_context().last_frame_time;
             get_context().last_frame_time = date::now();
+
+            #[cfg(any(target_arch = "wasm32", target_os = "linux"))]
+            {
+                let _z = telemetry::ZoneGuard::new("glFinish/glFLush");
+
+                unsafe {
+                    miniquad::gl::glFlush();
+                    miniquad::gl::glFinish();
+                }
+            }
         }

         telemetry::reset();

I've had a look around, and for example, Intel discourages using glFinish() in its OpenGL optimization guide.

Potentially, this could be a misuse of the OpenGL APIs, but I'm no expert in this.

Regardless, at the very least, telemetry should be preset according to the profile (e.g. disabled in release mode, and enabled in debug mode), or anyway, it should be configurable, and if enabled in release mode, it should be very clearly documented that it can potentially hurt performance.

martinxyz commented 2 years ago

Out of curiosity, I switched to the nouveau driver (open source) and the CPU load didn't disappear.

The traceback was still going through the same glFinish() call, though usually ending at poll() (which would be good) and almost never at sched_yield(). Not 100% trusting this. Anyway, strace still showed ~8'000 sched_yield calls per second.

When I comment out the glFlush(); glFinish(); lines added by the telemetry commit (see previous comment by @64kramsystem) then the sched_yield() syscalls go away.

My "Ctrl-C" tracebacks in gdb now go from miniquad::graphics::Context::clear() through dri_image_drawable_get_buffers() and end up in a poll() syscall.

CPU load also seems lower, but I'm having trouble measuring it reliably. Load average and top disagree, and it seems to depend on the window size. A larger window gives lower(?!) CPU load.

In case it matters, all my tests were on Gnome3 and Xorg (not Wayland).

mhcerri commented 1 year ago

@not-fl3, what exactly are you trying to achieve with this piece of code bellow? Is there any reason it is only used for linux and wasm32?

@@ -263,6 +265,16 @@ impl EventHandlerFree for Stage {

             get_context().frame_time = date::now() - get_context().last_frame_time;
             get_context().last_frame_time = date::now();
+
+            #[cfg(any(target_arch = "wasm32", target_os = "linux"))]
+            {
+                let _z = telemetry::ZoneGuard::new("glFinish/glFLush");
+
+                unsafe {
+                    miniquad::gl::glFlush();
+                    miniquad::gl::glFinish();
+                }
+            }
         }

         telemetry::reset();

glFinish() is known for causing busy waits with Nvidia drivers. If that is really needed, do you think it would be possible to only call it when the profiler is enabled?

Thanks for the awesome project!

DissolveDZ commented 4 months ago

this is still an active issue, I might do a pull request soon if i get around to it

not-fl3 / macroquad

High (100%) CPU usage on Linux/Nvidia, even with vsync enabled #275