torhve / weechat-matrix-protocol-script

A WeeChat script in Lua that implements the matrix.org chat protocol
349 stars 52 forks source link

Weechat 'pauses' typing about every 10 seconds #30

Open firecat53 opened 8 years ago

firecat53 commented 8 years ago

There is a short pause every 10 seconds or so when typing in any Weechat buffer while the matrix.lua script is active. I'm guessing it's a polling interval? Is there any way to background that poll so it doesn't affect the user's typing?

Hope I explained that right! Thanks,

Scott

Edit: sorry...Weechat version 1.5 (archlinux)

torhve commented 8 years ago

It might be possible matrix.lua is cause of this problem, but it's not normal behaviour, and not something I can easily reproduce on my end. It might also be other scripts interfering, or other system problems, like DNS/network. The default polling timeout is 30 seconds for the script.

One way you can investigate more what matrix and other scripts are doing with hooks is running the command /debug set core 2. This will print every call to HTTP and any other hook from any script. Maybe you can use that and look in the core buffer to see what causes the behaviour you are seeing.

The polling is already in background using WeeChat's API mechanisms for doing HTTP requests, but since WeeChat is single threaded every single thing that uses CPU can cause soft lockups.

Hope this helps.

firecat53 commented 8 years ago

Looks like it's a function of the large number of people in the main #matrix:matrix.org room.

I turned on debug both for weechat and for the matrix script but didn't really see anything interesting in either output that corresponded with the typing lag.

  1. With all other scripts unloaded and buffers closed, I loaded the matrix.lua script.
  2. With debug enabled, I could see the very slow download of the participants list and it never finished (kept starting over) and I never experienced the typing lag.
  3. With debug disabled, the list downloaded completely, but then I'd get the short lag every 10 seconds when typing.
  4. With a different room open (only 2 participants) and the #matrix:matrix.org room closed, I didn't see the lag at all with debug enabed or disabled.

Thus my theory that it has to do with the large number of participants in the room. So you can probably close this unless there's a way to work around that :)

Thanks, Scott

torhve commented 8 years ago

Thanks for researching this. I think the nicklist issues you are seeing might be a slowness in WeeChat itself, as matrix.lua is just sending nicks to add to WeeChat. And the way typing notifications and presence works is by commiting changes to the nicklist using the WeeChat API, so it might be some slowness in that API when the nicklist is huge.

Personally I don't have nicklist enabled all the time, I just toggle it on and off with a keybinding ( /key bind meta-n /bar toggle nicklist ). You might also make a bar condition or trigger that disabled showing the nicklist for huge rooms.

@flashcode might be interested in this slowness issue, if we could find out exactly what is causing it.

jfrederickson commented 7 years ago

I've been investigating this recently, not sure if it's the same issue but it seems similar enough. In my case, WeeChat begins pegging a CPU core, and eventually a WeeChat child process terminates (you can see it as defunct in ps output). The weechat UI freezes when this happens.

Will grab the output I was seeing when I get back to the machine I was testing it on, but do you know of any logs I could pull to help diagnose this?

EDIT: Oh wait, debug logging was mentioned earlier in this issue. I'll give that a try.

jfrederickson commented 7 years ago

Didn't see anything particularly notable in the weechat debug logs, but here's the output from strace just before it freezes:

[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] write(10, "\25\3\3\0\32(l\323\353\27\253\305\200\274S\3359\322\313\371\37\304nS\341\27\302\350\243\350t", 31) = 31
[pid 30103] close(10)                   = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f084070a080}, NULL, 8) = 0
[pid 30103] write(1, "{\"next_batch\":\"s2257165_6884504_"..., 290) = 290
[pid 30103] exit_group(0)               = ?
[pid 30103] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30103, si_uid=1000, si_status=0, si_utime=2, si_stime=2} ---
fcntl(6, F_GETFD)                       = 0
fcntl(0, F_GETFD)                       = 0
fcntl(10, F_GETFD)                      = 0
fcntl(13, F_GETFD)                      = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 4, 99) = 2 ([{fd=10, revents=POLLIN|POLLHUP}, {fd=13, revents=POLLHUP}])
read(10, "HTTP/1.1 200 OK\r\nDate: Thu, 13 O"..., 8191) = 616
read(13, "", 8191)                      = 0
fcntl(6, F_GETFD)                       = 0
fcntl(0, F_GETFD)                       = 0
fcntl(10, F_GETFD)                      = 0
poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=10, events=POLLIN}], 3, 98) = 1 ([{fd=10, revents=POLLHUP}])
read(10, "", 8191)                      = 0
fcntl(6, F_GETFD)                       = 0
fcntl(0, F_GETFD)                       = 0
poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}], 2, 97) = 0 (Timeout)
wait4(30103, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 30103
pipe([11, 12])                          = 0
pipe([14, 15])                          = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f08415baa10) = 30106
close(12)                               = 0
close(15)                               = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
kill(30103, SIGKILL)                    = -1 ESRCH (No such process)
wait4(30103, NULL, 0, NULL)             = -1 ECHILD (No child processes)
close(10)                               = 0
close(13)                               = 0
rt_sigaction(SIGTSTP, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7f084070a080}, {0x7f0840940300, [], SA_RESTORER|SA_RESTART, 0x7f084070a080}, 8) = 0
poll([{fd=0, events=POLLIN}], 1, 0)     = 0 (Timeout)
poll([{fd=0, events=POLLIN}], 1, 0)     = 0 (Timeout)
rt_sigaction(SIGTSTP, {0x7f0840940300, [], SA_RESTORER|SA_RESTART, 0x7f084070a080}, NULL, 8) = 0

Note that it hangs here for about 10 seconds before continuing:

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=30103, si_uid=1000, si_status=0, si_utime=2, si_stime=2} ---

...so the parent process doesn't call wait4() for quite some time leaving the child as a zombie.

I only experience this when logging in with an account that is in a lot of very large rooms. On another account, it looks similar, but wait4() is called immediately after the child process exits:

pid 29353] poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=12, events=POLLIN}, {fd=14, events=POLLIN}], 4, 36 <unfinished ...>
[pid 29600] <... write resumed> )       = 386
[pid 29600] exit_group(0)               = ?
[pid 29353] <... poll resumed> )        = 1 ([{fd=14, revents=POLLHUP}])
[pid 29353] read(14, "", 8191)          = 0
[pid 29353] fcntl(6, F_GETFD)           = 0
[pid 29353] fcntl(0, F_GETFD)           = 0
[pid 29353] fcntl(12, F_GETFD)          = 0
[pid 29353] poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}, {fd=12, events=POLLIN}], 3, 22) = 1 ([{fd=12, revents=POLLHUP}])
[pid 29353] read(12, "", 8191)          = 0
[pid 29353] fcntl(6, F_GETFD)           = 0
[pid 29353] fcntl(0, F_GETFD)           = 0
[pid 29353] poll([{fd=6, events=POLLIN}, {fd=0, events=POLLIN}], 2, 22 <unfinished ...>
[pid 29600] +++ exited with 0 +++
<... poll resumed> )                    = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29600, si_uid=1000, si_status=0, si_utime=6, si_stime=2} ---
restart_syscall(<... resuming interrupted poll ...>) = 0
wait4(29600, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 29600
pipe([13, 15])                          = 0
pipe([16, 17])                          = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f08415baa10) = 29602
close(15)                               = 0
close(17)                               = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3545, ...}) = 0
kill(29600, SIGKILL)                    = -1 ESRCH (No such process)
wait4(29600, NULL, 0, NULL)             = -1 ECHILD (No child processes)
close(12)                               = 0
strk commented 7 years ago

I'm also seeing the lag issue and looking at top(1) showed weechat periodically keeping CPU usage above 90%. Unloading the matrix script fixed it

aliencq commented 7 years ago

Noticing it too. High CPU load and random Weechat freezes with matrix connected and synced (just loading matrix.lua doesn't make troubles)

torhve commented 7 years ago

The matrix protocol makes an extreme number of HTTP requests, especially for large rooms. WeeChat was not designed to handle this as smooth as it could have been.

I have introducted a new setting

/set plugins.var.lua.matrix.presence_filter on

This setting will prevent synapse from sending presence events, and ephemeral events, like typing notification and read notices. This will limit the number of HTTP requests a lot. Please try and see if it helps.

The plan for the future is use a (web)socket transport instead of regular HTTP, which WeeChat supports better. But there is no official such transport yet, only proof of concept proxy.

strk commented 7 years ago

I pulled the changes, started weechat and /script load matrix.lua As a result, the UI freezes periodically, still, since:

 matrix: Connecting to homeserver URL: https://matrix.org/

Executing /set plugins.var.lua.matrix.presence_filter on doesn't change things, I still cannot type smoothly, it looks like it just never connects (or gives notification of connection status).

This happens as of commit baab3b7748934ba6bbb4a8c84dd0536fb19a0ae5

strk commented 7 years ago

BTW, it looks like plugins.var.lua.matrix.typing_notices seems to be set to on forceably upon loading matrix.lua, can it be ?

jfrederickson commented 7 years ago

Can confirm, this doesn't appear to have fixed the issue for me either. It is a bit better, but I still get UI freezes every few seconds.

Apteryks commented 6 years ago

I can confirm this as well using 1.4 as packaged in Ubuntu 16.04, even with when setting the new plugins.var.lua.matrix.presence_filter to on.

EDIT: I restarted Weechat and it now seems OK. Maybe it was caused by some first time fetching?

matthijskooijman commented 6 years ago

EDIT: I restarted Weechat and it now seems OK. Maybe it was caused by some first time fetching?

I did /set plugins.var.lua.matrix.presence_filter on and /set plugins.var.lua.matrix.typing_notices off, which didn't seem to help. Then I did /lua reload matrix, and now the delays seem gone (I still see an occasional delay, probably because of the other requests still happening?). It seems a reload (or weechat restart) is needed to let these settings take effect?

erdnaxeli commented 6 years ago

It seems a reload (or weechat restart) is needed to let these settings take effect?

Exactly, the majority of settings are only loaded when on the script startup.

ara4n commented 6 years ago

So I just dusted off the plugin, fixed some issues which meant I couldn't connect (https://github.com/torhve/weechat-matrix-protocol-script/pull/105) and tried it on WeeChat 2.0. This bug seems worse than ever and makes it entirely unusable, with weechat pausing for ~20s every ~30s or so with 100% CPU, locking the whole UI. I'm now trying to restart with the presence_filter on and typing_notices off - but I assume that whenever /sync returns due to some background room receiving traffic, i'm going to see a similar delay.

What I don't understand is why weechat chews 100% CPU whilst this is happening, if it's just the threading model of the HTTP library being stupid. Also, critically, it used to be fine back in 2015, and I used Weechat as my default client. Then there was some upgrade in Weechat (perhaps 1.3 to 1.4?) and it all broke. I do not think it was correlated with my account getting particularly large (although i may have exceeded some threshold).

@torhve can you remind us why you think the number of HTTP reqs causes this behaviour?

ara4n commented 6 years ago

(i've now followed https://github.com/torhve/weechat-matrix-protocol-script/issues/30#issuecomment-348124061 and disabled presence and typing notifs, and have yet to see any freezes).

tidux commented 6 years ago

@ara4n this actually looks like the stalls are an issue in Synapse, much like that riot-web search issue you and I were debugging earlier this week. I've got two terminal windows open next to this browser - one with a mosh session watching htop on my homeserver, and the other one running WeeChat with this script locally. The 100%-CPU lockup while stalling seem to be Weechat related, but the stalls themselves match 1:1 with usage spikes on my homeserver, both the initial-load stalls and the activity related stalls (typing notifications, room loads, etc.). I'm the only human user on my homeserver for now so the only reason it ever spikes is when I'm doing something odd on a client.

I can confirm the stalls go away with the setting tweaks mentioned in https://github.com/torhve/weechat-matrix-protocol-script/issues/30#issuecomment-348124061 .

ara4n commented 6 years ago

hm. i think the fact that Synapse's CPU spikes to coincide with weechat stalling is a symptom rather than a cause. If weechat ends up tightlooping whilst it waits for an HTTP response from the server for some reason, then you'd expect the server CPU to shoot up whilst it calculates the response (which will inevitably go to 100% as it tries to calculate a particularly big timeconsuming response). However, the fact that the client freezes is Weechat managing to block the UI whilst waiting for an HTTP request to complete, rather than it actually being Synapse's fault that it's having to think hard about the response.