Randomly goes to 100% CPU

olynch commented 6 years ago

Every so often, kitty starts taking up 100% CPU and I have to close all kitty windows and restart to get it back to normal. I have no idea what causes it, I can run some diagnostics next time it happens if you tell me what to do.

kovidgoyal commented 6 years ago

There are instruction in the README on how to do performance profiling. If you run kitty like that you should be able to get some info on which functions are using the CPU. Also post the output of

kitty --debug-config

Note that you need kitty 0.8.3 for that.

josswright commented 6 years ago

Just noting that I came here to report a similar issue. I'll go and read the README with instructions and see if I can profile it.

I have noticed that it seems to happen for me when I have kitty 'hidden' under the i3 window manager by moving it to the scratchpad. (https://i3wm.org/docs/userguide.html#_scratchpad). I haven't noticed yet whether it happens just if kitty is on another non-visible workspace. This is happening relatively frequently, but I haven't been able to trigger the behaviour on-demand.

For reference, when I bring the kitty window back from the scratchpad in i3 the CPU usage drops again.

I'll see if I can profile it, but maybe this might help someone else track it down as well.

inducer commented 6 years ago

Also seeing this--in my situation, it's while the screensaver is on with Gnome 3.

Backtrace while the 100% CPU is happening:

#0  0x00007f173e71396f in __GI___select (nfds=5, readfds=0x7ffff83682a0, writefds=0x0, exceptfds=0x0, timeout=0x0)
    at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x00007f17382d0af5 in glfwWaitEvents () from /home/andreas/pack/kitty/kitty/glfw-x11.so
#2  0x00007f173db7c00c in main_loop.lto_priv () from /home/andreas/pack/kitty/kitty/fast_data_types.so
#3  0x00000000004c5352 in _PyCFunction_FastCallKeywords ()
#4  0x000000000054ffe4 in ?? ()
#5  0x00000000005546cf in _PyEval_EvalFrameDefault ()
#6  0x000000000054f0e8 in ?? ()
#7  0x0000000000550116 in ?? ()
#8  0x00000000005546cf in _PyEval_EvalFrameDefault ()
#9  0x000000000054f0e8 in ?? ()
#10 0x0000000000550116 in ?? ()
#11 0x00000000005546cf in _PyEval_EvalFrameDefault ()
#12 0x000000000054f0e8 in ?? ()
#13 0x0000000000550116 in ?? ()
#14 0x00000000005546cf in _PyEval_EvalFrameDefault ()
#15 0x000000000054f0e8 in ?? ()
#16 0x0000000000550116 in ?? ()
#17 0x00000000005546cf in _PyEval_EvalFrameDefault ()
#18 0x000000000054fbe1 in ?? ()
#19 0x0000000000550b93 in PyEval_EvalCode ()
#20 0x000000000054dbfa in ?? ()
#21 0x00000000004c53cd in _PyCFunction_FastCallKeywords ()
#22 0x000000000054ffe4 in ?? ()
#23 0x00000000005546cf in _PyEval_EvalFrameDefault ()
#24 0x000000000054fbe1 in ?? ()
#25 0x000000000054fe6d in ?? ()
#26 0x00000000005546cf in _PyEval_EvalFrameDefault ()
#27 0x000000000054fbe1 in ?? ()
#28 0x0000000000550c0e in PyEval_EvalCodeEx ()
#29 0x000000000048c0d6 in ?? ()
#30 0x0000000000459eee in PyObject_Call ()
#31 0x00000000004410a7 in ?? ()
#32 0x00000000004421b3 in Py_Main ()
#33 0x0000000000421ff4 in main ()

kovidgoyal commented 6 years ago

According to that backtrace kitty is waiting in a select call with no timeout, so I dont see how it could be using 100% CPU.

inducer commented 6 years ago

Thread 2 (by gdb's numbering) for completeness, looks similar:

#0  0x00007f173e7115d9 in __GI___poll (fds=0x7f173e2935e0 <fds>, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f173db7a6d3 in io_loop () from /home/andreas/pack/kitty/kitty/fast_data_types.so
#2  0x00007f173f5cc5aa in start_thread (arg=0x7f1733741700) at pthread_create.c:463
#3  0x00007f173e71bcbf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I agree that it's mystifying why this would eat 100% CPU, unless one of the FDs perpetually had data... I have this process sitting in GDB. If there's something you'd like me to try, let me know.

kovidgoyal commented 6 years ago

It should be easy enough to check if the select/poll is actually waiting or not. Simply stick a printf() into main_loop() and io_loop() in child-monitor.c

kovidgoyal commented 6 years ago

Something like:

printf("[%f] main_loop tick\n", montonic());

Arnaudv6 commented 6 years ago

My experience is that of @josswright: I am using openbox (with compton) and kitty with transparency. Kitty often uses 100% of a thread (not 100% all cores thanks goodness). This happens only when hidden (xdotool windowunmap $wid). And not always/reproducibly then. And when I show the window again, kitty drops its CPU usage to normal.

kovidgoyal commented 6 years ago

The only way I can see that happening is if your window manager is sending continuous events to kitty when hidden. Somebody that can reproduce that will need ot debug to find out why the window manager is doing that.

josswright commented 6 years ago

I'm afraid that I'm not in a position to debug right now, but it seems worth mentioning that I'm not using transparency.

katef commented 6 years ago

I regularly see this too. Using herbstluftwm (a tiling window manager) on Xorg, no transparency, no compositing. I typically run several terminals at once. What I've noticed is all instances will simultaneously jump to 100%/n CPU each.

If I leave my machine alone for long enough (even without a screensaver coming on), this almost always happens. Sometimes a terminal is visible on the screen at the time, but most of them aren't visible. I don't know if the visible terminal is also at 100% CPU, but I suspect not.

They stay that way, until I do something to snap them out of it. Switching desktops or sometimes even just pressing shift (!) seems to do it, or just moving the mouse. Not always, though. So it'd seem reasonable that X events are involved somehow.

shoober420 commented 6 years ago

This may be similar to why all my cores go to 100% when I close numerous kitty processes with my window manager (#608 and #609 ). I spawn several kitty processes with my window manager keybind. I'm not just spawning a window or tab within kitty, it's an all new kitty process. When I close these processes with my window manager keybind, the windows disappear, but remain running in the background. So maybe since they are "hidden" windows, it bugs out and causes my CPU cores to spike to 100%.

equaeghe commented 6 years ago

Also happening with sway (wayland clone of i3wm) for kitty 0.10.1. My laptop's fans start to activate shortly after I lock the screen (with swaylock).

kovidgoyal commented 5 years ago

@shoober420 I highly doubt it. Closing windows does not hide them, it destroys them. You issues are almost certainly caused by buggy graphics drivers.

@equaeghe You should upgrade to 0.11 it includes a fix I made for a glfw bug that might be relevant: https://github.com/glfw/glfw/issues/1281

shoober420 commented 5 years ago

The reason I say they are "hidden", is becaused although the windows close, the process associated with them still run instead of closing along with it. I have a NVIDIA GeForce GTX 760 using the latest proprietary drivers and X.Org 1.20 with bspwm.

kovidgoyal commented 5 years ago

There is no code in kitty to hide windows, only destroy them.

shoober420 commented 5 years ago

Yes, but the window manager attempts to destroy the window, but the process associated with that specific window remains open. So although the window is destroyed, the process for it is not destroyed along with it, making it "hidden" in a sense. You can also say it's "invisible" or "minimized".

kovidgoyal commented 5 years ago

All those terms have very specific meanings, none of which can possibly apply to kitty. In the bug reports you posted, I pointed you to where in the code you can check what is happening. Do that, speculating without that is pointless.

afranchuk commented 5 years ago

I've seen this occur too, after leaving my workstation with xscreensaver up. When I come back and login, X is going crazy slow and I can't interact with kitty. I have to (eventually) send a SIGKILL to kitty. Then X calms down and the system becomes reactive again. What's worse, though, is that somehow the X/kitty interaction stays in this state, so if I try to launch a new kitty process when none are running, things just go to 100% and I can't interact for a number of seconds. Then the kitty process is very unresponsive. This doesn't always happen; I've had it happen 3 times now, only after being away for an hour or more. Rebooting the machine fixes the issue. I haven't tried explicitly restarting X, but that would likely fix it too.

shoober420 commented 5 years ago

For me, when those dormant kitty processes are running in the background without a visible window present, killing X doesn't kill the kitty processes. Since there's multiple instances of the process running, you would have to find the PID of the current kitty process to close the correct process without destroying the process that has the visible window, so it's a pain the ass.

It's best to just quit X, kill all kitty processes outside of X at the terminal with htop, then launch X again.

inducer commented 5 years ago

Here's an odd observation I made last night. My computer had a few Kittys (kitties?) running, I was not at the keyboard, but the screensaver was not yet on. I was ssh'd into the machine and could see the kitties going 100% cpu in htop. I was using a Bluetooth keyboard at the time that can switch between multiple devices. Switching to the computer in question (from another device) stopped the kitties from using 100% CPU. But the second I switched the Bluetooth keyboard back to my other device, they started going full clip again.

I don't know if that obsevation helps, but I figured I would report it.

freed00m commented 5 years ago

I see 100% throttle on kitty processes when I unplug my USB keyboard.

tormodatt commented 5 years ago

kitty use 99% + of CPU whenever I use the diff kitten. when I quit the diff it goes straight back to normal. I can trigger it on demand.

I'm running on OS X, kitty 0.12.3

kovidgoyal commented 5 years ago

I cannot reproduce with current kitty, update you version is pretty old.

tormodatt commented 5 years ago

sorry, I didn't realise that brew outdated only displayed non-casks.

I still have the same issue on 0.13.2 though.

both on kitty +kitten diff and git difftool --no-symlinks --dir-dif

the diff command creates a second kitty process which is the one that hogs the cpu. this process is killed when exiting the diff.

tried with both fish shell and bash shell. and both empty diff.conf and kitty.conf. still same issue

kovidgoyal commented 5 years ago

Then it is likely something macOS specific and in the diff kitten not kitty generally, open a separate bug report for it.

disruptek commented 5 years ago

I'm running kitty-HEAD on sway-HEAD against mesa-18.3.5 on Linux. I can reliably reproduce the issue by toggling dpms with

swaymsg "output * dpms off"
swaymsg "output * dpms on"

All kitty processes consume 100% of a core and their windows vanish from the compositor. Here are some logs from sway:

2019-03-20 16:09:26 - [sway-9999/sway/commands.c:253] Handling command 'output * dpms off'
2019-03-20 16:09:26 - [wlroots-9999/backend/drm/legacy.c:21] HDMI-A-1: Failed to page flip: Invalid argument
2019-03-20 16:09:26 - [wlroots-9999/backend/drm/legacy.c:21] HDMI-A-1: Failed to page flip: Invalid argument
2019-03-20 16:09:26 - [sway-9999/sway/ipc-server.c:528] IPC Client 69 disconnected
2019-03-20 16:09:26 - [sway-9999/sway/commands.c:253] Handling command 'output * dpms on'
2019-03-20 16:09:26 - [wlroots-9999/backend/drm/drm.c:807] Skipping pageflip on output 'HDMI-A-1'
2019-03-20 16:09:26 - [sway-9999/sway/ipc-server.c:528] IPC Client 69 disconnected
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1075] Scanning DRM connectors
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1233] 'HDMI-A-1' disconnected
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1439] Emitting destruction signal for 'HDMI-A-1'
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1075] Scanning DRM connectors
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1174] 'HDMI-A-1' connected
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1181] Physical size: 1100x620
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1192] Detected modes:
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   3840x2160@30000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   3840x2160@29970
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   3840x2160@25000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   3840x2160@24000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   3840x2160@23976
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1920x1080@60000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1920x1080@59940
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1920x1080@50000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1920x1080@24000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1920x1080@23976
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1440x900@59887
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1280x960@60000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1280x720@60000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1280x720@59940
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1280x720@50000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1024x768@75029
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1024x768@70069
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   1024x768@60004
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   800x600@75000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   800x600@72188
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   800x600@60317
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   720x576@50000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   720x480@60000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   720x480@59940
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   640x480@75000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   640x480@72809
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   640x480@60000
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   640x480@59940
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1213]   720x400@70082
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:1288] Requesting modeset for 'HDMI-A-1'
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:542] Modesetting 'HDMI-A-1' with '3840x2160@30000 mHz'
2019-03-20 16:09:27 - [wlroots-9999/backend/drm/drm.c:807] Skipping pageflip on output 'HDMI-A-1'
error in client communication (pid 22941)
[destroyed object]: error 6: size overflow for plane 0
error in client communication (pid 3275)
[destroyed object]: error 6: size overflow for plane 0
error in client communication (pid 3363)
[destroyed object]: error 6: size overflow for plane 0

If I throw a sleep between the dpms off/on, then I can sometimes keep the active kitty, but it's not reliable. What does seem to be reliable is that those pids mentioned correspond to the kitty processes that vanish from the compositor, and the number of skipped/failed page flips correspond similarly.

I will dig a little deeper, but if there's something in particular to look at, feel free to chime in and point me in the right direction. And, thank you for kitty. :heart:

disruptek commented 5 years ago

(gdb) backtrace 8
#0  0x00007ffff7314ed8 in __pthread_mutex_unlock_usercnt ()
   from /lib64/libpthread.so.0
#1  0x00007fffeae8ee85 in wl_display_prepare_read_queue ()
   from /usr/lib64/libwayland-client.so.0
#2  0x00007fffeb0b116d in handleEvents (timeout=<optimized out>)
    at /usr/src/debug/x11-terms/kitty-9999/kitty-9999/glfw/wl_window.c:849
#3  _glfwPlatformWaitEvents ()
    at /usr/src/debug/x11-terms/kitty-9999/kitty-9999/glfw/wl_window.c:1343
#4  _glfwPlatformRunMainLoop (data=0x7fffebf10870,
    callback=0x7fffee7630f0 <process_global_state>)
    at /usr/src/debug/x11-terms/kitty-9999/kitty-9999/glfw/main_loop.h:36
#5  glfwRunMainLoop (
    callback=0x7fffee7630f0 <process_global_state>,
    data=0x7fffebf10870)
    at /usr/src/debug/x11-terms/kitty-9999/kitty-9999/glfw/init.c:328
#6  0x00007fffee761dca in run_main_loop (
    cb=0x7fffee7630f0 <process_global_state>,
    cb_data=0x7fffebf10870)
    at /usr/src/debug/x11-terms/kitty-9999/kitty-9999/kitty/glfw.c:1143
#7  main_loop (self=0x7fffebf10870, a=<optimized out>)
    at /usr/src/debug/x11-terms/kitty-9999/kitty-9999/kitty/child-monitor.c:937

disruptek commented 5 years ago

Looking at the code, it seems most likely that the event dispatch is failing, but since we ignore the return value from processing the queue, we just infinite loop in the top of handleEvents(). We should probably act on an error there in any event, so to speak.

(edit) for that matter, we should check the return value of the prepare_read call; a -1 is probably not cause to keep looping.

kovidgoyal commented 5 years ago

I cannot reproduce with sway-1.0, kitty HEAD and mesa 18.3.4 on intel graphics. I dont see any problems withthte top of handleEvents. According to the wayland docs wl_display_prepare_read will return -1 iff errno is set to EAGAIN, otherwise it will return 0. There is even example recommended usage that does what handleEvents() does.

https://people.freedesktop.org/~whot/wayland-doxygen/wayland/Client/classwl__display.html#a40039c1169b153269a3dc0796a54ddb0

disruptek commented 5 years ago

You're right, of course, but here's the thing -- unless we're talking about the one infinite loop in Cupertino, no matter where they are or why, infinite loops basically turn your software into a space heater. That may be fine in March, but summer is coming, and I hate the way my legs look in shorts.

I probably should have searched for the Wayland API docs instead of just reading the code, so thanks for that link. Did you notice the following line preceding the code example?

A correct usage would be (with most error checking left out):

We are getting an error result inside the loop, but we do nothing to check the result and/or act on it. I understand the intention of looping on EAGAIN, but do you understand how badly I want to wear pants?

Can we compromise somehow? For the sake of pants? Pants, man. THINK OF THE PANTS!

kovidgoyal commented 5 years ago

Umm there is no other value or errno to check for. Go read the source code of wl_display_prepare_read_queue https://github.com/wayland-project/wayland/blob/master/src/wayland-client.c#L1645

Once again, that function only ever sets errno to EAGAIN if and only if it also returns -1.

disruptek commented 5 years ago

There are two possible return values, right? -1 and 0. Hence the existence of this loop; otherwise, it would be by design infinite. But merely by capping the number of times we call this loop, we may stave off shorts. And believe me, we both want that.

With regard to the body of the loop, have you verified that the only possible return value is -1? I haven't, but it seems a damned shame to waste a big, beautiful -- and they ARE beautiful, aren't they, fellas? -- int on a series of functions that may as well be typed a mincing and homely void.

martinetd commented 5 years ago

@disruptek instead of arguing for the sake of arguing, try looking for what wl_display_dispatch_pending() returns when you get that busy loop and log errors or something.

I agree that if that function fail we probably should break out of the loop (the way the prepare function works it'll just never return 0 if nothing consumes events because of some error) but without knowing what error you're getting we can't really tell if it's a normal error or what happened. (as a second step, some logs with WAYLAND_DEBUG=client set might also be helpful if it's quick enough to happen, but first please confirm the problem)

disruptek commented 5 years ago

I did try client logging and I did try looking through the code. I get the sense that it returns -1 in the case of an error, but that error may be the result of many (30+?) other function calls. I haven't looked at Wayland/X code in years, while there are many here who can probably tell exactly what's going on at a glance.

In any event, it doesn't even matter what's going on. The fact is that it's effectively an infinite loop because we are too lazy to ever, ever, leave a loop. It's mere icing that we don't test the return value of a function which is clearly failing. I care enough about the environment that I make a point of closing all my terminal windows whenever I leave my console.

I'm not sure what a "normal" error is, but if an infinite loop is a "normal" error, well, then I feel like something has been lost. A piece of my soul. The children I never knew I wanted. Amanda from 9th grade. Now, she, I knew I wanted.

I will log some return values, but can we at least agree that infinite loops are bad? I feel like that point was glossed over a bit.

PS. How many megabytes of -1 would you like? They come in all sizes.

kovidgoyal commented 5 years ago

I dont see how breaking the loop if wl_display_dispatch_pending() fails will do any good. The only errors that could prevent dispatch of already read events are fatal in wayland. It means the display can no longer be used. So breaking out of the loop would just mean the main loop will tick and we will be back to square one. So instead of looping in handleEvents it would loop in the main loop. But sure, I can have the loop check for a display error and forcibly close the window if one occurs. So instead of an infinite loop you get application quit. That wont actually solve whatever the error you are having is though

disruptek commented 5 years ago

That would be awesome. All I wanted. Thanks again.

kovidgoyal commented 5 years ago

Done.

disruptek commented 5 years ago

Works great, and it should solve the problems of many in the thread. FWIW, I found some kernel logs that shed some light:

Mar 22 12:04:09 olive kernel: snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD data byte 7
Mar 22 12:04:10 olive kernel: nouveau 0000:01:00.0: gr: DATA_ERROR 0000000f [VIEWPORT_ID_NEEDS_GP] ch 3 [017f574000 kitty[24858]] subc 0 class c197 mthd 19d0 data 0000003c
Mar 22 12:04:10 olive kernel: nouveau 0000:01:00.0: gr: DATA_ERROR 000000ac [] ch 3 [017f574000 kitty[24858]] subc 0 class c197 mthd 0d78 data 00000004

So I'm thinking the issue is something to do with a sleep/wake cycle in the nvidia card.

Arnaudv6 commented 5 years ago

It's been a while (>6months) since I hit this bug, but I hit it with intel GPU.

disruptek commented 4 years ago

FWIW, I haven't had kitty terminate upon suspend/resume since switching to the 0.14.2 release of 2019-06-09. Though my nouveau driver hadn't changed, I did update mesa to 19.1.0 on on 2019-06-12. I'm now in a linux-5.1 branch and mesa 19.1.1 and kitty 0.14.2 remains stable.

Having your terminals continue to run even after you suspend/resume is one of those "features" that you really take for granted until it's gone. My colleagues feel similarly about my pants, so we all thank you!

If you don't care about tracking the open glfw bug here, then I think this issue could be closed. It conflates what may be 3-4 different issues on different OSes, none of which seem to remain with the current version; we would want fresh error reporting in any event.

kovidgoyal commented 4 years ago

Yeh I agree, this issue is a bit too unfocused.

kovidgoyal / kitty

Randomly goes to 100% CPU #426