Open oliversturm opened 3 years ago
06:22:30.208 [INFO] [wlr] [wayland] error marshalling arguments for keymap: dup failed: Too many open files
Indicates that something is exhausting the maximum number of file descriptors. The modesets failing on amdgpu is also worrying, but shouldn't cause your applications to close. The exhaustion of FDs is probably what causes your apps to close.
Okay, thanks - I'll look into that. However, I'm confused - these errors appear in the middle of that large block of output that is suddenly triggered out of the blue, and which involves disabling and re-enabling outputs and input devices, rearranging workspaces and other stuff. So all this stuff happens on purpose, for a reason, while the machine is just sitting there? That seems unexpected to me.
So...
✦ ➜ sysctl fs.file-nr
fs.file-nr = 22400 0 9223372036854775807
✦ ➜ sysctl fs.file-max
fs.file-max = 9223372036854775807
✦ ➜ ulimit -Hn
524288
✦ ➜ ulimit -Sn
1024
I'm not sure what's up with that enormous number for file-max
. Is that normal? I don't do anything to set this, but when I did in the past, I don't remember seeing crazy numbers like this. However, everything else looks quite normal to me. Unless we're running into the soft process limit here? I might try changing this.
I found a rather old Xwayland issue that resulted in a similar error message. But I have no reason to suspect that this bug somehow still exists.
I'm not sure what's up with that enormous number for file-max. Is that normal?
Systemd bumped the default for this a while back: http://0pointer.net/blog/file-descriptor-limits.html
Update: I tried changing the soft limit to 100k or so, but this did not make a difference and the same issue happened. Then I updated Sway on Saturday, so sway -v
now says sway version 1.6-3bf99198 (Jun 5 2021, branch 'master')
- and since then I haven't had any trouble. I'll keep the issue open a little while in case it comes back, but it's possible that this problem has somehow been fixed recently.
I am experiencing exactly the same problem pretty much since I started using sway sometime last year. It also happens with the sway 1.6
update on my box.
Would it help to provide a debug log from my side as well? Or any other information which might be helpful?
When I come back the next morning, usually 10-12 hours later, everything is good - nothing crashed. This to me is very strange - I would expect the same thing to happen again, if it was related (perhaps) to the monitors being switched off, or something similar.
For me, it happens super randomly. Sometimes it is good for days and sometimes it happens quite quickly after a reboot. However, i only notice it after the screen was locked or attempted to lock. (Today, happened again, however it seemed like the auto lock was not properly triggered. Screens were blanked out (dpms off), but no lock screen was there. Pressing a key immediately turned on the screens and brought me back to the (broken) desktop (almost all windows disappeared, alacrity terminals still there, but not responding to user input, way bar unresponsive)
Every newly opened window from now on behaves like expected. Also killing and restarting waybar, kinda brings the system back to a working state.
Other observation: Most windows disappeared including ALL Xwayland clients. However, also xdg_shell clients were killed (firefox, thunderbird, nemo, etc) Some windows however seem to survive, e.g. MellowPlayer, alacritty. BUT the to not respond to user input, nor do they update any visual state. At least MellowPlayer seems to still be active (playing music, controllable via remote)
It would be super nice to figure out/fix that problem. It is highly annoying. I am scared to lock my screen while working, but leaving everything on draws a lot more power. :( Fixing this would highly improve my sway experience... Not sure if that helps at all. Just my observations.
i got the same issue (atleast i think so) but after removing dpms related lines from my swaylock exec, it does not happen
I'm leaving the issue open since others have the same problem now. However, unfortunately I can't contribute more at this point because I don't see the problem anymore. I see that my last comment is from 16 days ago, and my machine now has an uptime of 15 days and almost 20 hours... somehow this problem was apparently fixed for me with the update I installed on June 7th. Hope it doesn't come back.
@emersion I'm not sure I agree with the subject change you applied. It was an interesting idea to focus on the message relating to open files, but as far as I can tell this did not turn out to be the actual cause of the problem.
I still think this is the cause of the issue. We're probably leaking FDs somewhere at some point, or a client causes Sway to keep too many FDs opened.
I raised my ulimits:
[le@p5750]: ~>$ ulimit -n -S
10240
[le@p5750]: ~>$ ulimit -n -H
5242880
Since then, the problem did not occur anymore. However, I also think it is too early to say it is fixed. It sometimes happens very rarely.
We're probably leaking FDs somewhere at some point, or a client causes Sway to keep too many FDs opened Is there any way to track it down easily? What can I do in the moment where everything crashes?
If someone can still reproduce, running this command when Sway is out of FDs could help figuring out what kind of FDs cause the error.
ls -l /proc/<pid of sway>/fd
But since killing clients apparently helps, not sure it's possible to get a useful trace.
It seems like the clients are being killed or kill themselves. At least as soon as I am back in control of the computer, there are almost no more applications running. :(
Yes, Sway kills clients if it runs out of FDs.
Out of curiosity I just had a look in /proc/<pid>/fd
and I'm reminded of the "card0" reference in my original log file - reminded because I notice that there are 11 different FDs for /dev/dri/card0
. Perhaps this is normal, but I thought it couldn't hurt to mention it. I'll check this out again in case I see the same problem.
These are probably used by Mesa, nothing to worry about.
Yes, Sway kills clients if it runs out of FDs.
Maybe we can add an extra debug log before / after killing to track what was killed and who had all those FDs open?
Yeah. Not that simple because libwayland does the killing internally.
I have literally the same as https://github.com/swaywm/sway/issues/6310#issuecomment-863216985 for a couple of month, except I have no swaylock e.g. only dpms things:
exec swayidle -w \
timeout 900 'swaymsg "output * dpms off"' \
resume 'swaymsg "output * dpms on"'
for now I raised ulimits too, and keep watching on it
UPD: 1 month without issues!
It's been stable for me since about a month or so.
I've been experiencing similar seemingly-random client terminations over the last week
I did see the file descriptor errors in my logs, but these have disappeared since I added this to both my /etc/systemd/system.conf and /etc/systemd/user.conf
[Manager]
DefaultLimitNOFILE=2048:1048576
However, I'm still seeing terminations, I just suppose they weren't directly caused by file descriptor limits, and I'm still seeing the same errors as the original report
In log files from the Firefox runs, I can see that these processes died with these messages: Gdk-Message: 19:29:31.588:
Error reading events from display: Connection reset by peer
andExiting due to channel error
.
Or similar errors:
firefox-bin: Error reading events from display: Broken pipe
sway: Exiting due to channel error
Interestingly enough, I don't get this on my gaming PC (AMD GPU) but do get this a few times a day on my work laptop (Dell Precision 5550, Intel GPU) but this could be related to the different workloads
I am on intel as well if that matters. Hybrid laptop with Nvidia, but that's probably unrelated.
Please try the patch from this response and report back: https://github.com/swaywm/sway/issues/5757#issuecomment-882028355
Weird, despite having bumped my file descriptor limits per https://github.com/swaywm/sway/issues/6310#issuecomment-880200606 I did just now have this happen again
I had Firefox, Zoom, Discord, Slack open (on my Intel machine), was in a Zoom meeting, and suddenly everything except Firefox disappeared
Found module libpng16.so.16 with build-id: 2dc0bce07f199bf983c07a05fb95a6f4af83a9b3
Found module libbz2.so.1.0 with build-id: 919597c477c9b2cb9cdbb7745ed6494ac0e6da60
Found module libGLX.so.0 with build-id: 46506ec29217449396b9ca7fb3fcb434587a1325
Found module libGLdispatch.so.0 with build-id: 71de43b4607206a0e6d5d3e2d8e9b61dfcf23770
Found module libgssapi_krb5.so.2 with build-id: 9be9d3348399b72b76161a64e6d9fd760b77163a
Found module libexpat.so.1 with build-id: 8850138eae6d9d4d43c5c4b2ac48393bc4279037
Found module libwayland-server.so.0 with build-id: 232648aebf61c9b61940ac383dbe27984deea6b2
Found module libpthread.so.0 with build-id: 07c8f95b4f3251d08550217ad8a1f31066229996
Found module libffi.so.7 with build-id: de60e99f39569d11d09160bbdcd486cedc87d2b6
Found module libfreetype.so.6 with build-id: 3131a701435f4d87afeab159b4aa57c4d151ffc3
Found module libfontenc.so.1 with build-id: 5a11f1fb8c3f2714be9eb6697318f20e301e1d2f
Found module libz.so.1 with build-id: 81bf6e728a6d6f5b105b0f8b25f6c614ce10452a
Found module ld-linux-x86-64.so.2 with build-id: 040cc3dd10461562f177df39e3be2f3704258c3c
Found module libc.so.6 with build-id: 4b406737057708c0e4c642345a703c47a61c73dc
Found module libGL.so.1 with build-id: c293b92f10cbc9574c45ff4e4c123fec01ab6b78
Found module libXau.so.6 with build-id: 1c67764663e07bec24d8951e5fd93f4d165979ff
Found module libtirpc.so.3 with build-id: 5bef2adfdee3df283f593b3e2d37b6dac405256a
Found module libnettle.so.8 with build-id: 9a878e513c02007598fcf1e2e286c2203f13536e
Found module libdl.so.2 with build-id: 5abc547e7b0949f89f3c0e21ab0c8331a7440a8a
Found module libxshmfence.so.1 with build-id: 8876d9ccf620858795724ca24b9e567585a77cec
Found module libm.so.6 with build-id: 2b8fd1f869ecab4e0b55e92f2f151897f6818acf
Found module libgbm.so.1 with build-id: adebd56990ed9194896090a8e52c2d0603ad7004
Found module libepoxy.so.0 with build-id: 90da22e0a8d12c6b90fb00d95a23cc657b599334
Found module libdrm.so.2 with build-id: 3aeff5403ca8d7589eabc05752eb613937f454a1
Found module libwayland-client.so.0 with build-id: 58038363d7ea1fd5e6532f6e5f90b1a3ce09388a
Found module libXfont2.so.2 with build-id: 5c679bc136a02ad17f90beac87befcb1f2470984
Found module libpixman-1.so.0 with build-id: 341f793dcada3a48a306a793d265a517e3f2e7d6
Found module Xwayland with build-id: a6f73b250105720e3d1f9aab91973aa51c211ecf
Stack trace of thread 2652:
#0 0x00007fecbcd98d22 raise (libc.so.6 + 0x3cd22)
#1 0x00007fecbcd82862 abort (libc.so.6 + 0x26862)
#2 0x000056216d69f74b n/a (Xwayland + 0x15c74b)
#3 0x000056216d69fafd n/a (Xwayland + 0x15cafd)
#4 0x000056216d57313d n/a (Xwayland + 0x3013d)
#5 0x00007fecbd2defda n/a (libwayland-client.so.0 + 0xafda)
#6 0x00007fecbd2da1f8 n/a (libwayland-client.so.0 + 0x61f8)
#7 0x00007fecbcc5eacd n/a (libffi.so.7 + 0x6acd)
#8 0x00007fecbcc5e03a n/a (libffi.so.7 + 0x603a)
#9 0x00007fecbd2ddfe4 n/a (libwayland-client.so.0 + 0x9fe4)
#10 0x00007fecbd2da563 n/a (libwayland-client.so.0 + 0x6563)
#11 0x00007fecbd2dbc7f wl_display_dispatch_queue_pending (libwayland-client.so.0 + 0x7c7f)
#12 0x000056216d57974d n/a (Xwayland + 0x3674d)
#13 0x000056216d69f141 n/a (Xwayland + 0x15c141)
#14 0x000056216d5df1e0 n/a (Xwayland + 0x9c1e0)
#15 0x000056216d570fea n/a (Xwayland + 0x2dfea)
#16 0x00007fecbcd83b25 __libc_start_main (libc.so.6 + 0x27b25)
#17 0x000056216d57278e n/a (Xwayland + 0x2f78e)
Stack trace of thread 2657:
#0 0x00007fecbcc4a8ca __futex_abstimed_wait_common64 (libpthread.so.0 + 0x158ca)
#1 0x00007fecbcc44270 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xf270)
#2 0x00007fecbab0852c n/a (iris_dri.so + 0x1c652c)
#3 0x00007fecbab02588 n/a (iris_dri.so + 0x1c0588)
#4 0x00007fecbcc3e259 start_thread (libpthread.so.0 + 0x9259)
#5 0x00007fecbce5a5e3 __clone (libc.so.6 + 0xfe5e3)
Stack trace of thread 2661:
#0 0x00007fecbcc4a8ca __futex_abstimed_wait_common64 (libpthread.so.0 + 0x158ca)
#1 0x00007fecbcc44270 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xf270)
#2 0x00007fecbab0852c n/a (iris_dri.so + 0x1c652c)
#3 0x00007fecbab02588 n/a (iris_dri.so + 0x1c0588)
#4 0x00007fecbcc3e259 start_thread (libpthread.so.0 + 0x9259)
#5 0x00007fecbce5a5e3 __clone (libc.so.6 + 0xfe5e3)
Stack trace of thread 2660:
#0 0x00007fecbcc4a8ca __futex_abstimed_wait_common64 (libpthread.so.0 + 0x158ca)
#1 0x00007fecbcc44270 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xf270)
Jul 21 13:02:43 tag-50853 sway[1327]: 03:31:07.520 [ERROR] [wlr] [xwayland/selection/outgoing.c:285] pipe() failed: Too many open files
Jul 21 13:02:43 tag-50853 sway[1327]: 03:31:07.521 [ERROR] [wlr] [xwayland/selection/incoming.c:466] convert selection failed
Jul 21 13:02:44 tag-50853 sway[1327]: 03:31:08.481 [ERROR] [wlr] [xwayland/selection/outgoing.c:285] pipe() failed: Too many open files
Jul 21 13:02:45 tag-50853 sway[1327]: 03:31:09.042 [ERROR] [wlr] [xwayland/selection/outgoing.c:285] pipe() failed: Too many open files
Jul 21 13:02:45 tag-50853 sway[1327]: 03:31:09.043 [ERROR] [wlr] [xwayland/selection/outgoing.c:285] pipe() failed: Too many open files
Jul 21 13:02:45 tag-50853 sway[1327]: 03:31:09.610 [ERROR] [wlr] [xwayland/selection/outgoing.c:285] pipe() failed: Too many open files
Jul 21 13:02:45 tag-50853 sway[1327]: 03:31:09.610 [ERROR] [wlr] [xwayland/selection/outgoing.c:285] pipe() failed: Too many open files
Jul 21 13:02:51 tag-50853 sway[1482]: 03:31:14.410 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:02:51 tag-50853 sway[1482]: 03:31:14.410 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:03 tag-50853 sway[1482]: 03:31:26.410 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:03 tag-50853 sway[1482]: 03:31:26.410 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:09 tag-50853 sway[1482]: 03:31:32.410 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:09 tag-50853 sway[1482]: 03:31:32.410 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:15 tag-50853 sway[1482]: 03:31:38.410 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:15 tag-50853 sway[1482]: 03:31:38.411 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:19 tag-50853 sway[1482]: 03:31:41.700 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:19 tag-50853 sway[1482]: 03:31:41.700 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:21 tag-50853 sway[1482]: 03:31:44.406 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:21 tag-50853 sway[1482]: 03:31:44.407 [ERROR] [swaybar/tray/item.c:127] :1.7/org/ayatana/NotificationItem/nm_applet IconPixmap: No such property “IconPixmap”
Jul 21 13:03:52 tag-50853 sway[1327]: 03:32:16.796 [ERROR] [wlr] [xwayland/selection/outgoing.c:285] pipe() failed: Too many open files
Jul 21 13:03:52 tag-50853 sway[1327]: 03:32:16.796 [ERROR] [wlr] [xwayland/selection/outgoing.c:285] pipe() failed: Too many open files
Jul 21 13:03:52 tag-50853 sway[1327]: 03:32:16.796 [ERROR] [wlr] [xwayland/selection/incoming.c:466] convert selection failed
Jul 21 13:03:52 tag-50853 sway[2652]: (EE)
Jul 21 13:03:52 tag-50853 sway[2652]: Fatal server error:
Jul 21 13:03:52 tag-50853 sway[2652]: (EE) wl_display@1: error 1: invalid arguments for zwp_linux_buffer_params_v1@40.add
Jul 21 13:03:52 tag-50853 sway[2652]: (EE)
Jul 21 13:03:52 tag-50853 audit[2652]: ANOM_ABEND auid=60224 uid=60224 gid=60224 ses=1 pid=2652 comm="Xwayland" exe="/usr/bin/Xwayland" sig=6 res=1
Jul 21 13:03:52 tag-50853 kernel: audit: type=1701 audit(1626836632.886:330): auid=60224 uid=60224 gid=60224 ses=1 pid=2652 comm="Xwayland" exe="/usr/bin/Xwayland" sig=6 res=1
Jul 21 13:03:52 tag-50853 systemd[1]: Created slice Slice /system/systemd-coredump.
sway.log from this session: https://gist.github.com/jokeyrhyme/d086e97c084c0ba658b0235a82fbbbc0
@jokeyrhyme you may need also tune your /etc/security/limits.conf
@vvrein I did previously, per https://github.com/swaywm/sway/issues/6310#issuecomment-880200606, the numbers are double the systemd defaults
Note (per that message) that I still get similar symptoms even where there are no log messages indicating the file descriptor limit is being reached
I'll double the numbers again and see if that helps :shrug:
Started running the following for a while:
]$ max=0; while true; do ls -l /proc/$(pidof sway)/fd > /tmp/sway-fds.txt; cur=$(wc -l /tmp/sway-fds.txt | awk '{print $1}'); if [ $cur -gt $max ]; then max=$cur; date; echo "Max FDs opened increased to $max"; cp /tmp/sway-fds.txt /tmp/sway-fds-max.txt; fi; sleep 1; done
<snip>
Mon Sep 13 03:47:59 AM CDT 2021
Max FDs opened increased to 287
Tue Sep 14 02:03:42 AM CDT 2021
Max FDs opened increased to 724
List of file descriptors and Sway debug log (lines truncated to 120 chars): https://gist.github.com/lae/850e82a0a9354c0d31795b0307fcaa99
Sudden increase there, and I guess everything either crashed or became unusable immediately afterwards. swayidle
must have activated there, since I came back to a frozen lock screen as well (and had to kill it externally). It does seem like it runs rampant with files in /dev/shm
.
Does this help? I can try some other things if needed, if there's anything I can do soon. (I don't think raising ulimits is a long term solution, so hopefully we can resolve this.)
I don't think raising ulimits is a long term solution, so hopefully we can resolve this
It seems at least to be a long term workaround. Haven't had any hickup since about 3 months now. I do agree, however, that the root cause should be fixes (if not already happened)
As a recent data-point, I have wlroots 0.14.1-2 (archlinux) and I believe this just happened to me this morning: running alacritty, Slack, Firefox, jumped into a Zoom for a few minutes, Firefox disappears a few minutes later
Definitely not complaining, but was curious about the patch notes for wlroots and (at first) thought there was a chance that work might have an impact here: https://github.com/swaywm/wlroots/releases/tag/0.14.1
Could this be the same as #6642? If so, a fix has been merged, maybe try latest master?
Sway Version:
swaymsg -t get_version
sayssway version 4.19.1 (2021-02-01)
,sway -v
says1.6-85291411 (Apr 23 2021, branch 'master')
Debug Log: gist link
Configuration File: gist link
Description:
This has happened to me several times - there is a pattern here, though I'm not sure exactly what it is.
Gdk-Message: 19:29:31.588: Error reading events from display: Connection reset by peer
andExiting due to channel error.
Since my debug log ran for several hours, I stripped it down and annotated it - I basically just cut out the large part in the middle where I was working without trouble for a few hours.
I'm not sure what to try at this point, suggestions are welcome. I started seeing this problem a while ago, but recently it has happened every evening. I'll make a test tonight by leaving the monitors on, in case that makes a difference...