processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6k stars 1.5k forks source link

HUGE MEM SPIKES / ejabberd_system_monitor:handle_overload : The system is overloaded with #4161

Closed nstiac closed 2 months ago

nstiac commented 4 months ago

Environment

Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml

```yaml loglevel: 4 shaper: normal: rate: 3000 burst_size: 6000 fast: 100000 shaper_rules: max_user_sessions: 10 max_user_offline_messages: 5000: xxxx 100: all c2s_shaper: none: **xxxx** normal: all s2s_shaper: fast auth_method: sql host_config: "xxxxxx": extauth_program: "xxxxxxxxxx" auth_method: [external] auth_use_cache: false sql_pool_size: 6 sql_keepalive_interval: 150 listen: - port: xxxx module: ejabberd_c2s max_stanza_size: 262144 shaper: c2s_shaper access: c2s - port: xxxx module: ejabberd_http tls: true - modules: mod_avatar: mod_bosh: max_inactivity: 150 mod_caps: {} mod_carboncopy: {} mod_client_state: mod_mam: mod_offline: mod_ping: {} mod_pubsub: mod_push: {} mod_push_keepalive: {} mod_roster: mod_shared_roster: mod_stats: {} mod_stream_mgmt: ```

ps.- Obvious omissions on config for security purposes.

Errors from error.log/crash.log

9 hours after dawn restart with a relatively small internal userbase which functions mainly with pubsub as a "logged in or not" status indicator :

2024-02-21 15:41:45.536850+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1072045** messages queued by 7 process(es) (1%) from the following applications: ejabberd (**1072045** msgs); the top processes are:
** <0.22333.0>: msgs = 404701, memory = 1592112560, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22332.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22392.0>: msgs = 383314, memory = 2501936152, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22391.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22343.0>: msgs = 257521, memory = 1070150064, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22342.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22412.0>: msgs = 26502, memory = 424769504, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22332.0>: msgs = 3, memory = 2584558100, initial_call = ejabberd_bosh:init/1, current_function = lists:-filter/2-lc$^0/1-0-/2, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd

2024-02-21 15:41:45.537348+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 153149 message(s) each

Bug description

After upgrading our server to an absurd amount of memory, we've recently been debugging our server performance and noticed some HUGE spikes on memory usage (30GB+ spikes on resident mem, 60GB+ virtual) coming from beam.smp/ejabberd whereas its average resident size remains steadily below ~400Mb before and after the spikes.

PID          TID      MINFLT      MAJFLT      VSTEXT      VSLIBS       VDATA      VSTACK      LOCKSZ       VSIZE        RSIZE       PSIZE       VGROW       RGROW      SWAPSZ      RUID          EUID            MEM      CMD        1/72
1862            -      7672e4           0        3.6M        9.6M       58.0G      132.0K        0.0K       62.9G        36.7G          0B       50.4G       31.1G          0B      ejabberd      ejabberd        39%      beam.smp

So, after looking at some logs we came upon these:

024-02-20 22:55:00.738664+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1967163** messages queued by 10 process(es) (2%) from the following applications: ejabberd (**1967163** msgs); the top processes are:
** <0.5005.3>: msgs = 819617, memory = 6331179968, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5004.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4927.3>: msgs = 694653, memory = 4190023404, initial_call = xmpp_stream_in:init/1, current_function = p1_server:collect_messages/3, ancestors = [<0.4926.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.5029.3>: msgs = 452852, memory = 3192091880, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5028.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4989.3>: msgs = 30, memory = 690304, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.4988.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4909.3>: msgs = 3, memory = 3101465796, initial_call = ejabberd_bosh:init/1, current_function = mnesia_lib:db_get/3, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd

2024-02-20 22:55:00.739087+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 196716 message(s) each

2024-02-20 22:56:21.989115+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.5028.3> consumes more than 5% of OS memory (msgs = 0, memory = 6431197880, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)

The only other possible relative information from that log is that it was filled with : Closing c2s session for ... Stream reset by peer.

2024-02-20 22:47:52.086816+00:00 [info] <0.4826.3>@ejabberd_c2s:process_auth_result/3:272 (tcp|<0.4826.3>) Accepted c2s PLAIN authentication for xxxx@yyyyy by external backend from customip
2024-02-20 22:47:52.087641+00:00 [info] <0.4826.3>@ejabberd_c2s:bind/2:450 (tcp|<0.4826.3>) Opened c2s session for rxxxx@yyyyy/zzzz
2024-02-20 22:47:52.103140+00:00 [info] <0.4826.3>@ejabberd_c2s:process_terminated/2:295 (tcp|<0.4826.3>) Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer

(Could it be that those "stream reset by peer" are actually accumulating as undelivered messages? If so, how to bypass that given it is definetely not something that can be handled on the client side as this is a known issue :: can we forego those messages and have'm purged somehow?)


In order to clean this up a bit, we cleared our logs and restarted the server yesterday at dawn to obtain a cleaner shorter log .. here's a clean copy of the past 12hours (as you can see only a few lines shown.. it was dawn afterall yet the problem still shows) ;

FULL LOG (not too big I promise) :

``` 2024-02-21 04:13:03.210833+00:00 [info] <0.110.0>@ejabberd_config:load/1:82 Loading configuration from /......../ejabberd.yml 2024-02-21 04:13:03.235022+00:00 [notice] <0.110.0>@ejabberd_logger:set/1:374 Changing loglevel from 'info' to 'warning' 2024-02-21 05:58:55.592409+00:00 [warning] <0.1755.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.1754.0>) Failed c2s HHHHHHHH authentication for r2c4e_8740f7e2a@customname from 79.116.194.11: Invalid username or password 2024-02-21 06:00:50.374004+00:00 [warning] <0.1775.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.1775.0>) Failed c2s HHHHHHHH authentication for r2c4e_8740f7e2a@customname from customip: Invalid username or password --- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer --- 2024-02-21 11:17:55.749593+00:00 [warning] <0.10425.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.10425.0>) Failed c2s HHHHHHHH authentication for r2c4e_06bb92905@customname from customip: Invalid username or password 2024-02-21 11:41:17.019494+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **33137** messages queued by 3 process(es) (1%) from the following applications: ejabberd (**33137** msgs); the top processes are: ** <0.11261.0>: msgs = 33134, memory = 233377736, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.11260.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.11196.0>: msgs = 2, memory = 34592, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.11195.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.11195.0>: msgs = 1, memory = 1038675412, initial_call = ejabberd_bosh:init/1, current_function = lists:keysearch/3, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd 2024-02-21 11:41:17.019824+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 1 process(es) consuming more than 11046 message(s) each 2024-02-21 12:36:39.426190+00:00 [warning] <0.14383.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.14382.0>) Failed c2s HHHHHHHH authentication for r2c2a_8ddc25571@customname from 92.9.103.72: Invalid username or password 2024-02-21 12:59:25.546390+00:00 [warning] <0.15533.0> unexpected event from XML driver: {xmlstreamelement, {xmlel,<<"presence">>, [{<<"xmlns">>,<<"jabber:client">>}], []}}; xmlstreamstart was expected 2024-02-21 13:35:12.294509+00:00 [warning] <0.17141.0> unexpected event from XML driver: {xmlstreamelement, {xmlel,<<"presence">>, [{<<"xmlns">>,<<"jabber:client">>}], []}}; xmlstreamstart was expected 2024-02-21 14:18:45.365339+00:00 [warning] <0.18747.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.18746.0>) Failed c2s HHHHHHHH authentication for r2ce4_20b991a74@customname from 178.139.173.139: Invalid username or password --- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer --- 2024-02-21 14:29:25.458126+00:00 [warning] <0.19138.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.19138.0>) Failed c2s HHHHHHHH authentication for r2c4e_48518f402@customname from customip: Invalid username or password 2024-02-21 15:41:45.536850+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1072045** messages queued by 7 process(es) (1%) from the following applications: ejabberd (**1072045** msgs); the top processes are: ** <0.22333.0>: msgs = 404701, memory = 1592112560, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22332.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.22392.0>: msgs = 383314, memory = 2501936152, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22391.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.22343.0>: msgs = 257521, memory = 1070150064, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22342.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.22412.0>: msgs = 26502, memory = 424769504, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.22332.0>: msgs = 3, memory = 2584558100, initial_call = ejabberd_bosh:init/1, current_function = lists:-filter/2-lc$^0/1-0-/2, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd 2024-02-21 15:41:45.537348+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 153149 message(s) each 2024-02-21 15:42:07.958236+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22412.0> consumes more than 5% of OS memory (msgs = 762077, memory = 5504763272, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd) 2024-02-21 15:42:07.968733+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **762293** messages queued by 3 process(es) (1%) from the following applications: ejabberd (**762293** msgs); the top processes are: ** <0.22412.0>: msgs = 762288, memory = 5505483600, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.22332.0>: msgs = 3, memory = 2584555140, initial_call = ejabberd_bosh:init/1, current_function = mnesia:dirty_index_read/3, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd ** <0.22342.0>: msgs = 2, memory = 3721758692, initial_call = ejabberd_bosh:init/1, current_function = mnesia:val/1, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd 2024-02-21 15:42:07.968996+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 1 process(es) consuming more than 254098 message(s) each 2024-02-21 15:45:08.022471+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22411.0> consumes more than 5% of OS memory (msgs = 0, memory = 5359331744, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd) 2024-02-21 15:50:10.641543+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22411.0> consumes more than 5% of OS memory (msgs = 0, memory = 5359331744, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd) --- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer --- ```

More info :

Please notice 👀 Where do 1+ MILLION messages come from? how can I see a sample of these "messages" ? (Could it be that those "stream reset by peer & failed authentications" are actually accumulating as undelivered messages? If so, can we forego those messages and have'm purged somehow?)

SQL Shows no "last or archive" nor any other relevant numbers.

Webadmin shows:

``` Statistics of ejabberd@localhost Uptime: | 47402.294-- | --  | 2024-02-21 04:13:02 CPU Time: | 1520.227 Online Users: | 63 Transactions Committed: | 10 Transactions Aborted: | 6 Transactions Restarted: | 0 Transactions Logged: | 4 Database Tables at ejabberd@localhost Name Elements Memory archive_msg 0 43,712 archive_prefs 0 43,712 bosh 56 10,952 bytestream 0 2,440 caps_features 0 43,712 ejabberd_commands 82 189,376 last_activity 0 43,712 mod_register_ip 0 2,440 motd 0 43,712 motd_users 0 43,712 mqtt_pub 0 43,712 mqtt_session 0 2,440 mqtt_sub 0 1,112 muc_online_room 0 1,112 muc_registered 0 2,440 muc_room 0 2,440 oauth_client 0 2,440 oauth_token 0 43,712 offline_msg 0 43,712 passwd 0 43,712 privacy 0 43,712 private_storage 0 43,712 pubsub_index 0 2,440 pubsub_item 0 43,712 pubsub_last_item 0 2,440 pubsub_node 0 2,440 pubsub_orphan 0 2,440 pubsub_state 0 1,112 push_session 0 43,712 reg_users_counter 0 2,440 roster 0 43,712 roster_version 0 43,712 route 6 3,512 route_multicast 0 2,440 s2s 0 2,440 schema 43 45,856 session 56 33,920 session_counter 0 2,440 sr_group 0 2,440 sr_user 0 2,440 temporarily_blocked 0 2,440 vcard 0 43,712 vcard_search 0 2,440 ```

Any help would be greatly appreciated on debugging what is going on and why is this happening.

ps.- Also, there's no correlation with failed/closed sessions or any burst on network activity.

licaon-kter commented 4 months ago

Where does mod_bosh_r2webusers come from?

prefiks commented 4 months ago

I guess r2webusers is a host that they use.

First thing, is your client able to use websocket instead of bosh? If so please change it to use websocket, bosh is slower and uses significantly more resources.

To diagnose this you could use ejabberd debug console. When you start it you can see processes that have big queues by executing p1_prof:q(). in it (that dot at end is significant, don't skip it). This will list 10 processes with biggest queues. To see what messages are queued, copy pid(..., ..., ...) from that previous command (it will have three numbers), and execute erlang:process_info(pid(... ,... , ...), messages). with that pid.

I am guessing those messages will be xmpp packets that need to be delivered to client, but bosh connection is just to slow to deliver them at rate those arrive.