processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/ejabberd/
Other
6.11k stars 1.51k forks source link

The system is overloaded with messages queued #3665

Closed marcelinojung closed 2 years ago

marcelinojung commented 3 years ago

Environment

Errors from error.log/crash.log

No errors

Bug description

I installed ejabberd on the ARM core for the first time, and it was running as a test with about 20 online users, After a few weeks, I received the following 'messages queue overload' warning log. There is no error log. When the client tried to log in 10 times, a timeout occurred 1 or 2 times. (Relevant logs are also written with considerable delay) After restarting ejabberd, the same log and issue does not occur until now. However, I think the same issue will occur if the messages are piled up again.

I've never seen such a delay or warning message when running ejabberd on x86. So, it seems that the ARM core erlang does not handle the message queue properly. However, the exact cause could not be found. Can you give me any information or ideas to find the cause?


2021-08-06 06:18:19.304120+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 874579 messages queued by 5 process(es) (2%) from the following applications: kernel (873157 msgs), ejabberd (1422 msgs); the top processes are: <0.79.0>: msgs = 872499, memory = 148083156, initial_call = erlang:apply/2, current_function = io:execute_request/2, ancestors = [], application = kernel <0.362.0>: msgs = 1409, memory = 320988, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd logger_std_h_default: msgs = 658, memory = 135024, initial_call = logger_olp:init/1, current_function = logger_std_h:file_ctrl_call/2, ancestors = [logger_sup,kernel_sup,<0.47.0>], application = kernel <0.363.0>: msgs = 9, memory = 3628, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd ** <0.361.0>: msgs = 4, memory = 3188, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd

2021-08-06 06:20:19.856403+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 874592 messages queued by 5 process(es) (2%) from the following applications: kernel (873170 msgs), ejabberd (1422 msgs); the top processes are: <0.79.0>: msgs = 873004, memory = 148127596, initial_call = erlang:apply/2, current_function = io:execute_request/2, ancestors = [], application = kernel <0.362.0>: msgs = 1409, memory = 266812, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd logger_std_h_default: msgs = 166, memory = 61444, initial_call = logger_olp:init/1, current_function = logger_std_h:file_ctrl_call/2, ancestors = [logger_sup,kernel_sup,<0.47.0>], application = kernel <0.363.0>: msgs = 9, memory = 3628, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd ** <0.361.0>: msgs = 4, memory = 3188, initial_call = ejabberd_listener:init/3, current_function = prim_inet:accept0/3, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd


licaon-kter commented 3 years ago

Gist of the config?

Do you really have so much traffic?

marcelinojung commented 3 years ago

traffic was normal. Currently, three ejaberd servers with the same config are being tested in development. All are x86, and this warning and issue occurred only on ARM. There is no config that would be particularly problematic. CPU or memory is very normal at 2-10%, and test users are about 20 users per server. The ARM server in question is aws t4g.medium(2 ARM-cpu/4G memory), which is a newly added to AWS EC2 group. In the existing t3a.medium(2 X86-cpu/4G memory) server, 20,000 users are operated online simultaniously for about 3 years, and about 10% of cpu/ 60% of memory is used.

It seemed that the massage continued to accumulate as shown below.

2021-08-06 06:24:07.254735+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 875678 messages queued by 5 process(es) (2%) from the following applications: kernel (874254 msgs), ejabberd (1424 msgs); the top processes are:

Looking at the previous log, 6,000 msgs were accumulated in 4 hours. 2021-08-06 02:15:25.119921+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 813766 messages queued by 5 process(es) (2%) from the following applications: kernel (812511 msgs), ejabberd (1255 msgs); the top processes are:

Is there a way to check the msg queue directly?

licaon-kter commented 3 years ago

Still waiting for ejabberd .yml config...

marcelinojung commented 3 years ago

ejabberd.yml

```yaml hosts: - "HOST NAME" host_config: "HOST NAME": auth_method: sql sql_type: mysql sql_server: "AWS RDS URL" sql_database: "DATABASE NAME" sql_username: "ID" sql_password: "PASSWORD" sql_port: "PORT" loglevel: debug log_rotate_size: 10485760 log_rotate_count: 100 new_sql_schema: true resource_conflict: closeold negotiation_timeout: 60 cache_size : 40000 certfiles: - ~/DIRECTORY/fullchain.pem - ~/DIRECTORY/private.key listen: - port: 5222 ip: "::" module: ejabberd_c2s max_stanza_size: 262144 shaper: c2s_shaper access: c2s - port: 5223 ip: "::" module: ejabberd_c2s max_stanza_size: 262144 shaper: c2s_shaper access: c2s starttls: true starttls_required: true tls_compression: false ciphers: "ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS" - port: 5443 ip: "::" module: ejabberd_http tls: true request_handlers: /admin: ejabberd_web_admin ##/api: mod_http_api /bosh: mod_bosh /mqtt: mod_mqtt /captcha: ejabberd_captcha /upload: mod_http_upload /ws: ejabberd_http_ws - port: 5280 ip: "::" module: ejabberd_http request_handlers: /api: mod_http_api - port: 3478 ip: "::" transport: udp module: ejabberd_stun use_turn: true - port: 1883 ip: "::" module: mod_mqtt backlog: 1000 tls: true s2s_use_starttls: optional acl: admin: user: - "admin": "HOST NAME" local: user_regexp: "" loopback: ip: - 127.0.0.0/8 - ::1/128 access_rules: local: allow: local c2s: deny: blocked allow: all announce: allow: admin configure: allow: admin muc_create: allow: local pubsub_createnode: allow: local trusted_network: allow: loopback api_permissions: "console commands": from: - ejabberd_ctl who: all what: "*" "admin access": who: access: allow: - acl: loopback - acl: admin oauth: scope: "ejabberd:admin" access: allow: - acl: loopback - acl: admin what: - "*" - "!stop" - "!start" "public commands": who: all what: - status - connected_users_number - unregister - register modules: mod_adhoc: {} mod_admin_extra: {} mod_announce: access: announce mod_bosh: {} mod_caps: {} mod_carboncopy: {} mod_client_state: {} mod_configure: {} mod_disco: {} mod_fail2ban: {} mod_http_api: {} mod_last: db_type: sql mod_mam: db_type: sql assume_mam_usage: true default: always mod_mqtt: db_type: sql mod_offline: db_type: sql access_max_user_messages: max_user_offline_messages mod_ping: send_pings: true ping_interval: 1 min ping_ack_timeout: 32 timeout_action: kill mod_push: {} mod_push_keepalive: {} mod_register: ip_access: trusted_network mod_roster: db_type: sql versioning: true mod_shared_roster: {} mod_stream_mgmt: queue_type: ram max_ack_queue: 5000 resend_on_timeout: if_offline mod_stun_disco: {} mod_vcard: {} mod_vcard_xupdate: {} mod_version: show_os: false ```
marcelinojung commented 3 years ago

After restarting ejabberd server(8/21), This issue recurred. If I calculate the rate at which the messages accumulate, it appears to have been reproduced at the expected time. Online users were 20~30 and (register user is about 200). Every client throws an iq about 1 or 2 times on every 10 seconds, and re-logs in every 5 minutes. I checked that the message is accumulated in mnesia, but I don't know how to check it yet. I can't find the same issue in multiple ejabberd logs still running on x86. Currently, I can only suspect that it is an issue with AWS ARM core, but I cannot be sure because there is no ejabberd running on other ARM. I think I will be able to take the next action if I find the correct evidence or solution. I want you to help.


2021-09-27 23:04:51.487243+00:00 [warning] <0.88.0>@ejabberd_system_monitor:handle_overload/2:138 The system is overloaded with 396569 messages queued by 26 process(es) (9%) from the following applications: kernel (392597 msgs), ejabberd (3972 msgs); the top processes are: <0.79.0>: msgs = 392392, memory = 75794620, initial_call = erlang:apply/2, current_function = io:execute_request/2, ancestors = [], application = kernel <0.300.0>: msgs = 3082, memory = 960916, initial_call = ejabberd_listener:init/3, current_function = gen:do_call/4, ancestors = [ejabberd_listener,ejabberd_sup,<0.119.0>], application = ejabberd <0.29348.5>: msgs = 279, memory = 426828, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [ejabberd_c2s_sup,ejabberd_sup,<0.119.0>], application = ejabberd logger_std_h_default: msgs = 205, memory = 77484, initial_call = logger_olp:init/1, current_function = logger_std_h:file_ctrl_call/2, ancestors = [logger_sup,kernel_sup,<0.47.0>], application = kernel ** <0.29945.5>: msgs = 162, memory = 284884, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [ejabberd_c2s_sup,ejabberd_sup,<0.119.0>], application = ejabberd

badlop commented 2 years ago

One of the clues you got is that it only happens in the ARM machine.

Erlang version: Erlang (SMP,ASYNC_THREADS) (BEAM) emulator version 11.1.7

I guess that was a version close to Erlang/OTP 23.3. I wonder if the problem is still present if using the latest stable 24.3, or even 25-rc1...

nstiac commented 8 months ago

This (or similar) also happens on regular Intel x64 (Xeon) machine. w/ Erlang/OTP 22 [erts-10.4.4] [source] [64-bit] & ejabberd 23.04 .. machine just suddenly loads up on messages.

2024-02-20 22:55:00.738664+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with 1967163 messages queued by 10 process(es) (2%) from the following applications: ejabberd (1967163 msgs); the top processes are: <0.5005.3>: msgs = 819617, memory = 6331179968, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5004.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd <0.4927.3>: msgs = 694653, memory = 4190023404, initial_call = xmpp_stream_in:init/1, current_function = p1_server:collect_messages/3, ancestors = [<0.4926.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd <0.5029.3>: msgs = 452852, memory = 3192091880, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5028.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd <0.4989.3>: msgs = 30, memory = 690304, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.4988.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.4909.3>: msgs = 3, memory = 3101465796, initial_call = ejabberd_bosh:init/1, current_function = mnesia_lib:db_get/3, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd

2024-02-20 22:55:00.739087+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 196716 message(s) each

2024-02-20 22:56:21.989115+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.5028.3> consumes more than 5% of OS memory (msgs = 0, memory = 6431197880, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)

The only other possible relative information is that log is filled with : Closing c2s session for ... Stream reset by peer.

2024-02-20 22:47:52.086816+00:00 [info] <0.4826.3>@ejabberd_c2s:process_auth_result/3:272 (tcp|<0.4826.3>) Accepted c2s PLAIN authentication for xxxx@yyyyy by external backend from 31.193.168.202 2024-02-20 22:47:52.087641+00:00 [info] <0.4826.3>@ejabberd_c2s:bind/2:450 (tcp|<0.4826.3>) Opened c2s session for rxxxx@yyyyy/zzzz 2024-02-20 22:47:52.103140+00:00 [info] <0.4826.3>@ejabberd_c2s:process_terminated/2:295 (tcp|<0.4826.3>) Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer

Could it be that those "stream reset by peer" are actually accumulating as undelivered messages? If so, how to bypass that given it is definetely not something that can be handled on the client side as this is a known issue (can we forego those messages and have'm purged somehow?)