processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6.03k stars 1.5k forks source link

ejabber crashes after a few minutes #4054

Open jaymzh opened 1 year ago

jaymzh commented 1 year ago

This seems to be roughly identical to https://github.com/processone/ejabberd/issues/3457 - but that's 2 years old, so re-opening it seemed bad form.

After the recent upgrade to 23.01, my ejabberd crashes, seemingly in due to the memsup system crashing.

Environment

Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml

```yaml loglevel: 2 log_rotate_size: infinity hosts: - "ipom.com" certfiles: - "/etc/letsencrypt/live/www.ipom.com/fullchain.pem" - "/etc/letsencrypt/live/www.ipom.com/privkey.pem" acme: auto: false define_macro: 'TLS_CIPHERS': "HIGH:!aNULL:!eNULL:!3DES:@STRENGTH" 'TLS_OPTIONS': - "no_sslv3" - "no_tlsv1" - "no_tlsv1_1" - "cipher_server_preference" - "no_compression" c2s_ciphers: 'TLS_CIPHERS' s2s_ciphers: 'TLS_CIPHERS' c2s_protocol_options: 'TLS_OPTIONS' s2s_protocol_options: 'TLS_OPTIONS' listen: - port: 5222 ip: "::" module: ejabberd_c2s max_stanza_size: 262144 shaper: c2s_shaper access: c2s starttls: true starttls_required: true protocol_options: 'TLS_OPTIONS' - port: 5269 ip: "::" module: ejabberd_s2s_in disable_sasl_mechanisms: - "digest-md5" - "X-OAUTH2" s2s_use_starttls: required auth_method: internal resource_conflict: closeold auth_password_format: scram shaper: normal: rate: 3000 burst_size: 20000 fast: 200000 shaper_rules: max_user_sessions: 10 max_user_offline_messages: 5000: admin 100: all c2s_shaper: none: admin normal: all s2s_shaper: fast max_fsm_queue: 1000 acl: local: user_regexp: "" loopback: ip: - "127.0.0.0/8" access_rules: local: allow: local c2s: deny: blocked allow: all announce: allow: admin configure: allow: admin muc_create: allow: local pubsub_createnode: allow: local trusted_network: allow: loopback api_permissions: "console commands": from: - ejabberd_ctl who: all what: "*" "admin access": who: access: allow: - acl: loopback - acl: admin oauth: scope: "ejabberd:admin" access: allow: - acl: loopback - acl: admin what: - "*" - "!stop" - "!start" "public commands": who: ip: 127.0.0.1/8 what: - status - connected_users_number language: "en" modules: mod_adhoc: {} mod_admin_extra: {} mod_announce: # recommends mod_adhoc access: announce mod_blocking: {} # requires mod_privacy mod_bosh: {} mod_caps: {} mod_carboncopy: {} mod_configure: {} # requires mod_adhoc mod_disco: {} mod_fail2ban: {} mod_http_api: {} mod_last: {} mod_mqtt: {} mod_muc: access: - allow access_admin: - allow: admin access_create: muc_create access_persistent: muc_create access_mam: - allow default_room_options: mam: true mod_muc_admin: {} mod_offline: access_max_user_messages: max_user_offline_messages mod_ping: {} mod_privacy: {} mod_private: {} mod_pubsub: access_createnode: pubsub_createnode ignore_pep_from_offline: true last_item_cache: false plugins: - "flat" - "pep" # pep requires mod_caps force_node_config: "eu.siacs.conversations.axolotl.*": access_model: open storage:bookmarks: access_model: whitelist mod_push: {} mod_push_keepalive: {} mod_register: welcome_message: subject: "Welcome!" body: |- Hi. Welcome to this XMPP server. ip_access: trusted_network access: register mod_roster: versioning: true mod_s2s_dialback: {} mod_shared_roster: {} mod_sic: {} mod_stream_mgmt: resend_on_timeout: if_offline mod_stun_disco: {} mod_stats: {} mod_time: {} mod_vcard: search: false mod_vcard_xupdate: {} mod_version: {} ```

Errors from error.log/crash.log

2023-06-28 21:22:00.228760+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2122316002.1753022465.257150>],dropped}
2023-06-28 22:39:48.010084+00:00 [error] <0.160.0> External eimp process (pid=379215) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.010359+00:00 [error] <0.162.0> External eimp process (pid=379216) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.010574+00:00 [error] <0.156.0> External eimp process (pid=379213) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:48.013799+00:00 [error] <0.158.0> External eimp process (pid=379214) has terminated unexpectedly, restarting in a few seconds
2023-06-28 22:39:56.499998+00:00 [error] <0.122.0>     application_master: shutdown_error
    ejabberd_app: {prep_stop,[[]]}
    error_info: {timeout,
                    {gen_server,call,
                        [ejabberd_hooks,
                         {delete,config_reloaded,global,ejabberd_listener,
                             config_reloaded,50}]}}
2023-06-28 22:40:21.604890+00:00 [error] <0.503.0>@gen_server:error_info/8:1402 ** Generic server memsup terminating 
** Last message in was {'EXIT',<0.504.0>,{port_died,normal}}
** When Server state == [{data,[{"Timeout",60000}]},
                         {items,{"Memory Usage",
                                 [{"Allocated",7658790912},
                                  {"Total",8311390208}]}},
                         {items,{"Worst Memory User",
                                 [{"Pid",<0.181.0>},{"Memory",3101864}]}}]
** Reason for termination ==
** {port_died,normal}

2023-06-28 22:40:21.606676+00:00 [error] <0.503.0>@proc_lib:crash_report/4:539 CRASH REPORT:
  crasher:
    initial call: memsup:init/1
    pid: <0.503.0>
    registered_name: memsup
    exception exit: {port_died,normal}
      in function  gen_server:handle_common_reply/8 (gen_server.erl, line 1241)
    ancestors: [os_mon_sup,<0.501.0>]
    message_queue_len: 1
    messages: [time_to_collect]
    links: [<0.502.0>]
    dictionary: [{system_memory_high_watermark,set}]
    trap_exit: true
    status: running
    heap_size: 6772
    stack_size: 28
    reductions: 26790
  neighbours:

2023-06-28 22:40:21.611440+00:00 [error] <0.502.0>@supervisor:do_restart/3:736 SUPERVISOR REPORT:
    supervisor: {local,os_mon_sup}
    errorContext: child_terminated
    reason: {port_died,normal}
    offender: [{pid,<0.503.0>},
               {id,memsup},
               {mfargs,{memsup,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,2000},
               {child_type,worker}]

2023-06-28 22:40:23.634790+00:00 [error] <0.88.0>@supervisor:do_terminate/2:905 SUPERVISOR REPORT:
    supervisor: {local,sasl_safe_sup}
    errorContext: shutdown_error
    reason: killed
    offender: [{pid,<0.89.0>},
               {id,alarm_handler},
               {mfargs,{alarm_handler,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,2000},
               {child_type,worker}]
badlop commented 1 year ago

crashes after a few minutes

I guess this error reproduces quite easily: you start ejabberd, and it soon crashes itself. Does it wait always the same time to crash (5 minutes, 15 minutes, ...?), or it seems random?

When this happens, is the CPU or RAM usage too high, or are they reasonably low?

Those questions hopefully will indicate if the crash is provoked by some script/configuration, or if it's an event triggered by the server circumstances.

jaymzh commented 1 year ago

Sometimes it's 30 seconds. Sometimes it's a minute or two. Sometimes 5 or more minutes. However, it's USUALLY 1 minute or so. Here's some data:

2023-06-30T16:07:11.554359+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:26:47.646615+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:28:09.117252+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:29:15.369275+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:30:36.581464+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:31:27.860877+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:32:49.032843+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:33:55.414993+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:35:01.470237+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:36:12.992789+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:42:25.841659+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:43:31.990521+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:45:03.705825+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:46:13.981039+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:47:20.277436+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:47:56.758296+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:54:06.612181+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:54:57.758462+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:56:04.118489+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:56:55.469798+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:58:32.802496+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T16:59:39.079769+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:01:00.454025+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:01:51.818705+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:02:28.562413+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:03:41.236860+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:04:47.763861+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).
2023-06-30T17:05:53.794003+00:00 virt systemd[1]: Started ejabberd.service - robust, scalable and extensible realtime platform (XMPP server + MQTT broker + SIP service).

You can see there 19 minutes between the first two, then just over a minute, then roughly every minute for a while, then 7 minutes...

I can't find any indications of running out of RAM or CPU. Load is between 0.03 and 0.08. There's 3G of ram free. There's no OOM killer events.

csrado commented 1 year ago

Hello I see very same behaviour my site

Environment 2core / 8gb ram Ubuntu 22.04.2 LTS

2:135 Cannot ping watchdog: {badmatch,{error,eacces}}

2023-08-01 06:17:52.706699+00:00 [error] <0.347.0>@ejabberd_systemd:handle_info/2:135 Cannot ping watchdog: {badmatch,{error,eacces}}
2023-08-01 06:18:04.010466+00:00 [notice] <0.66.0> SIGTERM received - shutting down

2023-08-01 06:18:04.010748+00:00 [error] <0.155.0> External eimp process (pid=32894) has terminated unexpectedly, restarting in a few seconds
2023-08-01 06:18:04.010871+00:00 [error] <0.153.0> External eimp process (pid=32893) has terminated unexpectedly, restarting in a few seconds
2023-08-01 06:18:04.021968+00:00 [error] <0.347.0>@ejabberd_systemd:handle_cast/2:122 Cannot send STOPPING=1 notification: {badmatch,{error,eacces}}
2023-08-01 06:18:04.022624+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at 10.254.208.140:5222 for ejabberd_c2s
2023-08-01 06:18:04.023243+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5223 for ejabberd_c2s
2023-08-01 06:18:04.023854+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:5269 for ejabberd_s2s_in
2023-08-01 06:18:04.024326+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:5270 for ejabberd_s2s_in
2023-08-01 06:18:04.024799+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5443 for ejabberd_http
2023-08-01 06:18:04.025289+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TLS connections at [::]:5280 for ejabberd_http
2023-08-01 06:18:04.021529+00:00 [error] <0.490.0>@gen_server:error_info/7:949 ** Generic server memsup terminating 
** Last message in was {'EXIT',<0.491.0>,{port_died,normal}}
** When Server state == [{data,[{"Timeout",60000}]},
                         {items,{"Memory Usage",
                                 [{"Allocated",2591793152},
                                  {"Total",8323547136}]}},
                         {items,{"Worst Memory User",
                                 [{"Pid",<0.184.0>},{"Memory",55128}]}}]
** Reason for termination ==
** {port_died,normal}

2023-08-01 06:18:04.025770+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting UDP connections at [::]:3479 for ejabberd_stun
2023-08-01 06:18:04.026101+00:00 [info] <0.120.0>@ejabberd_listener:stop_listener/3:402 Stop accepting TCP connections at [::]:1883 for mod_mqtt
2023-08-01 06:18:04.028122+00:00 [error] <0.490.0>@proc_lib:crash_report/4:525 CRASH REPORT:
  crasher:
    initial call: memsup:init/1
    pid: <0.490.0>
    registered_name: memsup
    exception exit: {port_died,normal}
      in function  gen_server:handle_common_reply/8 (gen_server.erl, line 811)
    ancestors: [os_mon_sup,<0.488.0>]
    message_queue_len: 0
    messages: []
    links: [<0.489.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 2586
    stack_size: 29
    reductions: 78671
  neighbours:

2023-08-01 06:18:04.041527+00:00 [error] <0.489.0>@supervisor:do_restart/3:736 SUPERVISOR REPORT:
    supervisor: {local,os_mon_sup}
    errorContext: child_terminated
    reason: {port_died,normal}
    offender: [{pid,<0.490.0>},
               {id,memsup},
               {mfargs,{memsup,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,2000},
               {child_type,worker}]

2023-08-01 06:18:04.078386+00:00 [info] <0.120.0>@ejabberd_app:stop/1:111 ejabberd 21.12-1 is stopped in the node ejabberd@localhost
badlop commented 1 year ago

Cannot ping watchdog: {badmatch,{error,eacces}}

Your log starts with that line, that previous cases didn't have. In the erlang documentation it says:

eacces: Missing search or write permissions for the parent directories of Dir.

jaymzh commented 1 year ago

@badlop - any thoughts on the original issue? The recurring crashes are really frustrating.

prefiks commented 1 year ago

Disable watchdog in ejabberd.service, it's probably systemd killing it because ejabberd not being able to communicate with watchdog.

csrado commented 1 year ago

I tried to play around ... install, remove, different version. Once was okay, error not appeared again. When I tried to reproduce again (installed 23.04) meaning setup all environment after install again just failed and getting same error.

when started via ejabberdctl start as root all is okay - no error.

When starting as service or systemctl

systemctl start ejabberd

getting

2023-08-03 16:03:25.269923+00:00 [error] <0.437.0>@ejabberd_systemd:handle_info/2:135 Cannot ping watchdog: {badmatch,{error,eacces}}

some permission issue, trying to identify but cannot see ...

csrado commented 1 year ago

Also see when use systemctl start ejabberd it just hangs, not release prompt

when do ctrl-c will get prompt

root@ub01-ds4:/etc/systemd/system# ps -ef | grep ejabberd root 56685 8641 0 16:00 pts/3 00:00:00 tail -f ejabberd.log ejabberd 56738 1 0 16:02 ? 00:00:00 /bin/sh /opt/ejabberd-23.04/bin/ejabberdctl foreground ejabberd 56749 56738 12 16:02 ? 00:00:18 /opt/ejabberd-23.04/erts-13.2/bin/beam.smp -K true -P 250000 -- -root /opt/ejabberd-23.04 -bindir /opt/ejabberd-23.04/erts-13.2/bin -progname erl -- -home /opt/ejabberd -- -sname ejabberd@localhost -boot ../releases/23.4.0/start_clean -boot_var RELEASE_LIB ../lib -mnesia dir "/opt/ejabberd/database/ejabberd@localhost" -s ejabberd -noshell -noinput ejabberd 56762 1 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/epmd -daemon ejabberd 56766 56749 0 16:02 ? 00:00:00 erl_child_setup 65536 ejabberd 56785 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/eimp-1.0.22/priv/bin/eimp ejabberd 56786 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/eimp-1.0.22/priv/bin/eimp ejabberd 56787 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/epam-1.0.12/priv/bin/epam ejabberd 56788 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/inet_gethost 4 ejabberd 56789 56788 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/erts-13.2/bin/inet_gethost 4 ejabberd 56790 56766 0 16:02 ? 00:00:00 /opt/ejabberd-23.04/lib/os_mon-2.8.1/priv/bin/memsup postgres 56792 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41238) idle postgres 56793 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41242) idle postgres 56794 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41250) idle postgres 56795 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41262) idle postgres 56796 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41268) idle postgres 56797 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41280) idle postgres 56798 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41294) idle postgres 56799 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41310) idle postgres 56800 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41328) idle postgres 56801 37580 0 16:02 ? 00:00:00 postgres: 14/main: ejabberd ejabberd 127.0.0.1(41326) idle root 56849 9138 0 16:04 pts/7 00:00:00 grep --color=auto ejabberd

jaymzh commented 1 year ago

To be clear here, the ORIGINAL issue is, I think, unrelated to the watchdog. I no indication the watchdog is unreachable in my logs.

jaymzh commented 1 year ago

@weiss that Connor had to do with the watchdog, and while @csrado output was the watchdog, mine (the original bug report) doesn't mention the watchdog. Do you think this commit fixes my/the original bug?

weiss commented 1 year ago

@jaymzh ah you're right, that's most probably unrelated (and @csrado's issue sounds like yet another thing, related to the watchdog but not to my fix).

I'll reopen your issue, sorry.

prefiks commented 1 year ago

It looks like this is also probably also from being killed by systemd (all those ports dying looks like external apps run as services getting killed, and because it looks like that those are all the drivers that we start, i would guess that systemd killing process spawned by ejabberd.service are most likely what did that).

weiss commented 1 year ago

@jaymzh, if you'd like to test that fix, you could make a backup of the file

/usr/lib/x86_64-linux-gnu/ejabberd-23.01-1/ebin/ejabberd_systemd.beam

and then replace it with this one. Restart ejabberd and see whether you can still reproduce the problem. (It's built for Erlang/OTP 25.x, so it won't necessarily work for others.)

jaymzh commented 1 year ago

OK, did that. I'll let you know how it goes...

jaymzh commented 1 year ago

It's still crashing.

BUT... I just realized, error.log has nothing of use in it - the logs I copied and pasted above, while from that date, don't seem to be the actual reason it's crashing, as that day was the only day it was logged. error.log only has this after that:

2023-06-30 18:11:51.850714+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2524952854.1901395970.103039>],dropped}
2023-07-01 22:37:47.048831+00:00 [error] <0.365.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.4180380941.2434859010.238508>],dropped}
2023-07-04 21:11:23.334292+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2021114025.138215425.185961>],dropped}
2023-07-05 20:04:17.233415+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.1227170254.3035430913.28121>],dropped}
2023-07-05 21:43:06.903055+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.3898408711.634716163.122272>],dropped}
2023-07-14 15:36:32.890722+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.1932254270.3659857923.247972>],dropped}
2023-07-27 22:00:42.434301+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2250007605.1785790465.240268>],dropped}
2023-08-08 20:51:28.751275+00:00 [error] <0.368.0>@ejabberd_systemd:handle_info/2:139 Got unexpected info: {[alias|#Ref<0.2106956634.3650682881.157422>],dropped}

But that doesn't seem to be the issue since it fails every few minutes and that's only every few days.

In ejabberd.log, things appear to be progressing normally... logs about local routes then suddenly it's starting again:

2023-08-09 17:17:46.540490+00:00 [debug] <0.539.0>@ejabberd_s2s:route/1:324 Local route:
...
2023-08-09 17:17:46.540539+00:00 [debug] <0.539.0>@ejabberd_s2s:start_connection/3:362 Finding connection for {<<"ipom.com">>,<<"gmail.com">>}
2023-08-09 17:17:46.540567+00:00 [debug] <0.539.0>@ejabberd_s2s:choose_pid/2:412 Using ejabberd_s2s_out <0.542.0>
2023-08-09 17:17:46.540592+00:00 [debug] <0.539.0>@ejabberd_s2s:route/1:329 Sending to process <0.542.0>
2023-08-09 17:18:19.529779+00:00 [info] <0.122.0>@ejabberd_config:load/1:82 Loading configuration from /etc/ejabberd/ejabberd.yml
2023-08-09 17:18:19.539887+00:00 [notice] <0.122.0>@ejabberd_logger:set/1:374 Changing loglevel from 'info' to 'debug'
2023-08-09 17:18:19.718257+00:00 [warning] <0.122.0>@gen_mod:warn_soft_dep_fail/2:582 Module mod_mam is recommended for module mod_muc but is not found in the config
2023-08-09 17:18:19.724383+00:00 [warning] <0.122.0>@gen_mod:warn_soft_dep_fail/2:582 Module mod_client_state is recommended for module mod_push_keepalive but is not found in the config
2023-08-09 17:18:19.730664+00:00 [info] <0.122.0>@ejabberd_config:load/1:89 Configuration loaded successfully
2023-08-09 17:18:19.763154+00:00 [debug] <0.73.0>@supervisor:report_progress/2:1557 PROGRESS REPORT:
...
prefiks commented 1 year ago

Could you also check syslog/kernlog to see if there is no mentions on beam process?

weiss commented 1 year ago

And the journalctl -u ejabberd output should include a line such as the following if it's related to the watchdog:

ejabberd.service: Watchdog timeout (limit 30s)!
jaymzh commented 1 year ago

Oh yeah, in syslog:

2023-08-09T17:26:36.040824+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:27:42.290679+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:29:05.541971+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:29:56.790673+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:31:18.042185+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:32:09.290852+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:33:30.540562+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!
2023-08-09T17:57:30.041086+00:00 virt systemd[1]: ejabberd.service: Watchdog timeout (limit 30s)!

That isn't in the journal though, fwiw. I think that's just because I have debug enabled so the journal has cycled around.

weiss commented 1 year ago

Do you get output from grep 'ejabberd_systemd:maybe_start_timer' ejabberd.log? (That's just to double-check the patch is in effect.)

jaymzh commented 1 year ago

Yup:

# grep 'ejabberd_systemd:maybe_start_timer' ejabberd.log | tail -n 10
2023-08-09 17:25:20.903814+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:26:42.195441+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:27:48.360714+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:29:11.645013+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:30:02.955898+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:31:24.152766+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:32:15.387876+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:33:36.599136+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 17:57:36.100613+00:00 [info] <0.365.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
2023-08-09 18:11:00.430281+00:00 [info] <0.365.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
weiss commented 1 year ago

Thanks, could you grep for ejabberd_systemd (with loglevel set to debug) and show all output between two crashes?

jaymzh commented 1 year ago

in ejabberd.log?

2023-08-09 18:42:50.883885+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
              {id,ejabberd_systemd},
              {mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:42:51.200821+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:43:05.884344+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:43:20.885429+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:43:57.141885+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:43:57.143461+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:43:57.143534+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
              {id,ejabberd_systemd},
              {mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:43:57.469138+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:44:12.144167+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:44:48.312132+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:44:48.313653+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:44:48.313731+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
              {id,ejabberd_systemd},
              {mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:44:48.600797+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:45:03.314257+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:45:18.315235+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:45:33.316238+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:09.602756+00:00 [debug] <0.368.0>@ejabberd_systemd:init/1:81 Got NOTIFY_SOCKET: /run/systemd/notify
2023-08-09 18:46:09.604392+00:00 [debug] <0.368.0>@ejabberd_systemd:get_watchdog_interval/0:155 Watchdog interval: 15000000 microseconds
2023-08-09 18:46:09.604471+00:00 [info] <0.368.0>@ejabberd_systemd:maybe_start_timer/1:164 Watchdog notifications enabled
              {id,ejabberd_systemd},
              {mfargs,{ejabberd_systemd,start_link,[]}},
2023-08-09 18:46:09.898492+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: READY=1
2023-08-09 18:46:24.605393+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:39.606416+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:46:54.607429+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
2023-08-09 18:47:14.288452+00:00 [debug] <0.368.0>@ejabberd_systemd:notify/2:181 Notifying systemd: WATCHDOG=1
weiss commented 1 year ago

Okay, thank you. So at some point ejabberd simply stops pinging the watchdog indeed.

As @prefiks said above, you can just disable the systemd watchdog for the moment to make the issue go away.

It seems you can reproduce the issue more easily than others, so I might come back with more debugging wishes, but for the next two weeks I'm on vacation.

jaymzh commented 1 year ago

Ah missed that. Disabled the watchdog with an override file. Thanks!

weiss commented 1 year ago

Did you run an older version of ejabberd with a similar load before, without running into that issue?

jaymzh commented 1 year ago

My 'load' is 1 client ;)

And yeah, it was fine for the last 10 years belly the upgrade.

weiss commented 11 months ago

My 'load' is 1 client ;)

And the OS is bored as well?

Either way, seems you can reproduce the issue better than others, who run into it only very occasionally. Do you happen to be up to some debugging, such as me giving you another patched ejabberd_systemd.beam file for testing, and/or running some calls within an ejabberdctl debug shell while reproducing the issue? If so, you could ping me by email (or XMPP): holger@jabber.fu-berlin.de (and we could then figure out some way to chat directly).

jaymzh commented 11 months ago

Yeah happy to - dropped in an updated ejabberd_systemd.beam from you above already, so happy to do more. I'll reach out.