Open firecat53 opened 8 years ago
It might be possible matrix.lua is cause of this problem, but it's not normal behaviour, and not something I can easily reproduce on my end. It might also be other scripts interfering, or other system problems, like DNS/network. The default polling timeout is 30 seconds for the script.
One way you can investigate more what matrix and other scripts are doing with hooks is running the command /debug set core 2
. This will print every call to HTTP and any other hook from any script. Maybe you can use that and look in the core buffer to see what causes the behaviour you are seeing.
The polling is already in background using WeeChat's API mechanisms for doing HTTP requests, but since WeeChat is single threaded every single thing that uses CPU can cause soft lockups.
Hope this helps.
Looks like it's a function of the large number of people in the main #matrix:matrix.org room.
I turned on debug both for weechat and for the matrix script but didn't really see anything interesting in either output that corresponded with the typing lag.
Thus my theory that it has to do with the large number of participants in the room. So you can probably close this unless there's a way to work around that :)
Thanks, Scott
Thanks for researching this. I think the nicklist issues you are seeing might be a slowness in WeeChat itself, as matrix.lua is just sending nicks to add to WeeChat. And the way typing notifications and presence works is by commiting changes to the nicklist using the WeeChat API, so it might be some slowness in that API when the nicklist is huge.
Personally I don't have nicklist enabled all the time, I just toggle it on and off with a keybinding ( /key bind meta-n /bar toggle nicklist
). You might also make a bar condition or trigger that disabled showing the nicklist for huge rooms.
@flashcode might be interested in this slowness issue, if we could find out exactly what is causing it.
I've been investigating this recently, not sure if it's the same issue but it seems similar enough. In my case, WeeChat begins pegging a CPU core, and eventually a WeeChat child process terminates (you can see it as defunct in ps output). The weechat UI freezes when this happens.
Will grab the output I was seeing when I get back to the machine I was testing it on, but do you know of any logs I could pull to help diagnose this?
EDIT: Oh wait, debug logging was mentioned earlier in this issue. I'll give that a try.
Didn't see anything particularly notable in the weechat debug logs, but here's the output from strace just before it freezes:
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] write(10, "\25\3\3\0\32(l\323\353\27\253\305\200\274S\3359\322\313\371\37\304nS\341\27\302\350\243\350t", 31) = 31
[pid 30103] close(10) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] write(1, "{\"next_batch\":\"s2257165_6884504_"..., 290) = 290
[pid 30103] exit_group(0) = ?
[pid 30103] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30103, si_uid=1000, si_status=0, si_utime=2, si_stime=2} ---
fcntl(6, F_GETFD) = 0
fcntl(0, F_GETFD) = 0
fcntl(10, F_GETFD) = 0
fcntl(13, F_GETFD) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 4, 99) = 2 ([{fd=10, revents=POLLIN|POLLHUP}, {fd=13, revents=POLLHUP}])
read(10, "HTTP/1.1 200 OK\r\nDate: Thu, 13 O"..., 8191) = 616
read(13, "", 8191) = 0
fcntl(6, F_GETFD) = 0
fcntl(0, F_GETFD) = 0
fcntl(10, F_GETFD) = 0
poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=10, events=POLLIN}], 3, 98) = 1 ([{fd=10, revents=POLLHUP}])
read(10, "", 8191) = 0
fcntl(6, F_GETFD) = 0
fcntl(0, F_GETFD) = 0
poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}], 2, 97) = 0 (Timeout)
wait4(30103, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 30103
pipe([11, 12]) = 0
pipe([14, 15]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f08415baa10) = 30106
close(12) = 0
close(15) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
kill(30103, SIGKILL) = -1 ESRCH (No such process)
wait4(30103, NULL, 0, NULL) = -1 ECHILD (No child processes)
close(10) = 0
close(13) = 0
rt_sigaction(SIGTSTP, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7f084070a080}, {0x7f0840940300, [], SA_RESTORER|SA_RESTART, 0x7f084070a080}, 8) = 0
poll([{fd=0, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=0, events=POLLIN}], 1, 0) = 0 (Timeout)
rt_sigaction(SIGTSTP, {0x7f0840940300, [], SA_RESTORER|SA_RESTART, 0x7f084070a080}, NULL, 8) = 0
Note that it hangs here for about 10 seconds before continuing:
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30103, si_uid=1000, si_status=0, si_utime=2, si_stime=2} ---
...so the parent process doesn't call wait4() for quite some time leaving the child as a zombie.
I only experience this when logging in with an account that is in a lot of very large rooms. On another account, it looks similar, but wait4() is called immediately after the child process exits:
pid 29353] poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=12, events=POLLIN}, {fd=14, events=POLLIN}], 4, 36 <unfinished ...>
[pid 29600] <... write resumed> ) = 386
[pid 29600] exit_group(0) = ?
[pid 29353] <... poll resumed> ) = 1 ([{fd=14, revents=POLLHUP}])
[pid 29353] read(14, "", 8191) = 0
[pid 29353] fcntl(6, F_GETFD) = 0
[pid 29353] fcntl(0, F_GETFD) = 0
[pid 29353] fcntl(12, F_GETFD) = 0
[pid 29353] poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=12, events=POLLIN}], 3, 22) = 1 ([{fd=12, revents=POLLHUP}])
[pid 29353] read(12, "", 8191) = 0
[pid 29353] fcntl(6, F_GETFD) = 0
[pid 29353] fcntl(0, F_GETFD) = 0
[pid 29353] poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}], 2, 22 <unfinished ...>
[pid 29600] +++ exited with 0 +++
<... poll resumed> ) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29600, si_uid=1000, si_status=0, si_utime=6, si_stime=2} ---
restart_syscall(<... resuming interrupted poll ...>) = 0
wait4(29600, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 29600
pipe([13, 15]) = 0
pipe([16, 17]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f08415baa10) = 29602
close(15) = 0
close(17) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
kill(29600, SIGKILL) = -1 ESRCH (No such process)
wait4(29600, NULL, 0, NULL) = -1 ECHILD (No child processes)
close(12) = 0
I'm also seeing the lag issue and looking at top(1) showed weechat periodically keeping CPU usage above 90%. Unloading the matrix script fixed it
Noticing it too. High CPU load and random Weechat freezes with matrix connected and synced (just loading matrix.lua doesn't make troubles)
The matrix protocol makes an extreme number of HTTP requests, especially for large rooms. WeeChat was not designed to handle this as smooth as it could have been.
I have introducted a new setting
/set plugins.var.lua.matrix.presence_filter on
This setting will prevent synapse from sending presence events, and ephemeral events, like typing notification and read notices. This will limit the number of HTTP requests a lot. Please try and see if it helps.
The plan for the future is use a (web)socket transport instead of regular HTTP, which WeeChat supports better. But there is no official such transport yet, only proof of concept proxy.
I pulled the changes, started weechat and /script load matrix.lua As a result, the UI freezes periodically, still, since:
matrix: Connecting to homeserver URL: https://matrix.org/
Executing /set plugins.var.lua.matrix.presence_filter on
doesn't
change things, I still cannot type smoothly, it looks like it just
never connects (or gives notification of connection status).
This happens as of commit baab3b7748934ba6bbb4a8c84dd0536fb19a0ae5
BTW, it looks like plugins.var.lua.matrix.typing_notices
seems to be set to on
forceably upon loading matrix.lua, can it be ?
Can confirm, this doesn't appear to have fixed the issue for me either. It is a bit better, but I still get UI freezes every few seconds.
I can confirm this as well using 1.4 as packaged in Ubuntu 16.04, even with when setting the new plugins.var.lua.matrix.presence_filter to on.
EDIT: I restarted Weechat and it now seems OK. Maybe it was caused by some first time fetching?
EDIT: I restarted Weechat and it now seems OK. Maybe it was caused by some first time fetching?
I did /set plugins.var.lua.matrix.presence_filter on
and /set plugins.var.lua.matrix.typing_notices off
, which didn't seem to help. Then I did /lua reload matrix
, and now the delays seem gone (I still see an occasional delay, probably because of the other requests still happening?). It seems a reload (or weechat restart) is needed to let these settings take effect?
It seems a reload (or weechat restart) is needed to let these settings take effect?
Exactly, the majority of settings are only loaded when on the script startup.
So I just dusted off the plugin, fixed some issues which meant I couldn't connect (https://github.com/torhve/weechat-matrix-protocol-script/pull/105) and tried it on WeeChat 2.0. This bug seems worse than ever and makes it entirely unusable, with weechat pausing for ~20s every ~30s or so with 100% CPU, locking the whole UI. I'm now trying to restart with the presence_filter on and typing_notices off - but I assume that whenever /sync returns due to some background room receiving traffic, i'm going to see a similar delay.
What I don't understand is why weechat chews 100% CPU whilst this is happening, if it's just the threading model of the HTTP library being stupid. Also, critically, it used to be fine back in 2015, and I used Weechat as my default client. Then there was some upgrade in Weechat (perhaps 1.3 to 1.4?) and it all broke. I do not think it was correlated with my account getting particularly large (although i may have exceeded some threshold).
@torhve can you remind us why you think the number of HTTP reqs causes this behaviour?
(i've now followed https://github.com/torhve/weechat-matrix-protocol-script/issues/30#issuecomment-348124061 and disabled presence and typing notifs, and have yet to see any freezes).
@ara4n this actually looks like the stalls are an issue in Synapse, much like that riot-web search issue you and I were debugging earlier this week. I've got two terminal windows open next to this browser - one with a mosh session watching htop
on my homeserver, and the other one running WeeChat with this script locally. The 100%-CPU lockup while stalling seem to be Weechat related, but the stalls themselves match 1:1 with usage spikes on my homeserver, both the initial-load stalls and the activity related stalls (typing notifications, room loads, etc.). I'm the only human user on my homeserver for now so the only reason it ever spikes is when I'm doing something odd on a client.
I can confirm the stalls go away with the setting tweaks mentioned in https://github.com/torhve/weechat-matrix-protocol-script/issues/30#issuecomment-348124061 .
hm. i think the fact that Synapse's CPU spikes to coincide with weechat stalling is a symptom rather than a cause. If weechat ends up tightlooping whilst it waits for an HTTP response from the server for some reason, then you'd expect the server CPU to shoot up whilst it calculates the response (which will inevitably go to 100% as it tries to calculate a particularly big timeconsuming response). However, the fact that the client freezes is Weechat managing to block the UI whilst waiting for an HTTP request to complete, rather than it actually being Synapse's fault that it's having to think hard about the response.
There is a short pause every 10 seconds or so when typing in any Weechat buffer while the matrix.lua script is active. I'm guessing it's a polling interval? Is there any way to background that poll so it doesn't affect the user's typing?
Hope I explained that right! Thanks,
Scott
Edit: sorry...Weechat version 1.5 (archlinux)