Closed marcelinojung closed 2 years ago
Gist of the config?
Do you really have so much traffic?
traffic was normal. Currently, three ejaberd servers with the same config are being tested in development. All are x86, and this warning and issue occurred only on ARM. There is no config that would be particularly problematic. CPU or memory is very normal at 2-10%, and test users are about 20 users per server. The ARM server in question is aws t4g.medium(2 ARM-cpu/4G memory), which is a newly added to AWS EC2 group. In the existing t3a.medium(2 X86-cpu/4G memory) server, 20,000 users are operated online simultaniously for about 3 years, and about 10% of cpu/ 60% of memory is used.
It seemed that the massage continued to accumulate as shown below.
2021-08-06 06:24:07.254735+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 875678 messages queued by 5 process(es) (2%) from the following applications: kernel (874254 msgs), ejabberd (1424 msgs); the top processes are:
Looking at the previous log, 6,000 msgs were accumulated in 4 hours. 2021-08-06 02:15:25.119921+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 813766 messages queued by 5 process(es) (2%) from the following applications: kernel (812511 msgs), ejabberd (1255 msgs); the top processes are:
Is there a way to check the msg queue directly?
Still waiting for ejabberd .yml config...
ejabberd.yml
After restarting ejabberd server(8/21), This issue recurred. If I calculate the rate at which the messages accumulate, it appears to have been reproduced at the expected time. Online users were 20~30 and (register user is about 200). Every client throws an iq about 1 or 2 times on every 10 seconds, and re-logs in every 5 minutes. I checked that the message is accumulated in mnesia, but I don't know how to check it yet. I can't find the same issue in multiple ejabberd logs still running on x86. Currently, I can only suspect that it is an issue with AWS ARM core, but I cannot be sure because there is no ejabberd running on other ARM. I think I will be able to take the next action if I find the correct evidence or solution. I want you to help.
2021-09-27 23:04:51.487243+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 396569 messages queued by 26 process(es) (9%) from the following applications: kernel (392597 msgs), ejabberd (3972 msgs); the top processes are: <0.79.0>: msgs = 392392, memory = 75794620, initial_call = erlang:apply/2, current_function = io:execute_request/2, ancestors = [], application = kernel <0.300.0>: msgs = 3082, memory = 960916, initial_call = ejabberd_listener:init/3, current_function = gen:do_call/4, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd <0.29348.5>: msgs = 279, memory = 426828, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [ejabberd_c2s_sup,ejabberd_sup,<0.119.0>], application = ejabberd logger_std_h_default: msgs = 205, memory = 77484, initial_call = logger_olp:init/1, current_function = logger_std_h:file_ctrl_call/2, ancestors = [logger_sup,kernel_sup,<0.47.0>], application = kernel ** <0.29945.5>: msgs = 162, memory = 284884, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [ejabberd_c2s_sup,ejabberd_sup,<0.119.0>], application = ejabberd
One of the clues you got is that it only happens in the ARM machine.
Erlang version: Erlang (SMP,ASYNC_THREADS) (BEAM) emulator version 11.1.7
I guess that was a version close to Erlang/OTP 23.3. I wonder if the problem is still present if using the latest stable 24.3, or even 25-rc1...
This (or similar) also happens on regular Intel x64 (Xeon) machine. w/ Erlang/OTP 22 [erts-10.4.4] [source] [64-bit] & ejabberd 23.04 .. machine just suddenly loads up on messages.
2024-02-20 22:55:00.738664+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with 1967163 messages queued by 10 process(es) (2%) from the following applications: ejabberd (1967163 msgs); the top processes are: <0.5005.3>: msgs = 819617, memory = 6331179968, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5004.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd <0.4927.3>: msgs = 694653, memory = 4190023404, initial_call = xmpp_stream_in:init/1, current_function = p1_server:collect_messages/3, ancestors = [<0.4926.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd <0.5029.3>: msgs = 452852, memory = 3192091880, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5028.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd <0.4989.3>: msgs = 30, memory = 690304, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.4988.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.4909.3>: msgs = 3, memory = 3101465796, initial_call = ejabberd_bosh:init/1, current_function = mnesia_lib:db_get/3, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
2024-02-20 22:55:00.739087+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 196716 message(s) each
2024-02-20 22:56:21.989115+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.5028.3> consumes more than 5% of OS memory (msgs = 0, memory = 6431197880, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)
The only other possible relative information is that log is filled with : Closing c2s session for ... Stream reset by peer.
2024-02-20 22:47:52.086816+00:00 [info] <0.4826.3>@ejabberd_c2s:process_auth_result/3:272 (tcp|<0.4826.3>) Accepted c2s PLAIN authentication for xxxx@yyyyy by external backend from 31.193.168.202 2024-02-20 22:47:52.087641+00:00 [info] <0.4826.3>@ejabberd_c2s:bind/2:450 (tcp|<0.4826.3>) Opened c2s session for rxxxx@yyyyy/zzzz 2024-02-20 22:47:52.103140+00:00 [info] <0.4826.3>@ejabberd_c2s:process_terminated/2:295 (tcp|<0.4826.3>) Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer
Could it be that those "stream reset by peer" are actually accumulating as undelivered messages? If so, how to bypass that given it is definetely not something that can be handled on the client side as this is a known issue (can we forego those messages and have'm purged somehow?)
Environment
Errors from error.log/crash.log
No errors
Bug description
I installed ejabberd on the ARM core for the first time, and it was running as a test with about 20 online users, After a few weeks, I received the following 'messages queue overload' warning log. There is no error log. When the client tried to log in 10 times, a timeout occurred 1 or 2 times. (Relevant logs are also written with considerable delay) After restarting ejabberd, the same log and issue does not occur until now. However, I think the same issue will occur if the messages are piled up again.
I've never seen such a delay or warning message when running ejabberd on x86. So, it seems that the ARM core erlang does not handle the message queue properly. However, the exact cause could not be found. Can you give me any information or ideas to find the cause?
2021-08-06 06:18:19.304120+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 874579 messages queued by 5 process(es) (2%) from the following applications: kernel (873157 msgs), ejabberd (1422 msgs); the top processes are: <0.79.0>: msgs = 872499, memory = 148083156, initial_call = erlang:apply/2, current_function = io:execute_request/2, ancestors = [], application = kernel <0.362.0>: msgs = 1409, memory = 320988, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd logger_std_h_default: msgs = 658, memory = 135024, initial_call = logger_olp:init/1, current_function = logger_std_h:file_ctrl_call/2, ancestors = [logger_sup,kernel_sup,<0.47.0>], application = kernel <0.363.0>: msgs = 9, memory = 3628, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd ** <0.361.0>: msgs = 4, memory = 3188, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd
2021-08-06 06:20:19.856403+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 874592 messages queued by 5 process(es) (2%) from the following applications: kernel (873170 msgs), ejabberd (1422 msgs); the top processes are: <0.79.0>: msgs = 873004, memory = 148127596, initial_call = erlang:apply/2, current_function = io:execute_request/2, ancestors = [], application = kernel <0.362.0>: msgs = 1409, memory = 266812, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd logger_std_h_default: msgs = 166, memory = 61444, initial_call = logger_olp:init/1, current_function = logger_std_h:file_ctrl_call/2, ancestors = [logger_sup,kernel_sup,<0.47.0>], application = kernel <0.363.0>: msgs = 9, memory = 3628, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd ** <0.361.0>: msgs = 4, memory = 3188, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd