processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6k stars 1.5k forks source link

ejabberdctl restart sometimes fails #4137

Open nandlab opened 6 months ago

nandlab commented 6 months ago

Happy New Year!

Environment

Errors from error.log/crash.log

No errors

ejabberd.log

2023-12-30 03:23:04.611133+01:00 [info] <0.2280.0>@ejabberd_config:load/1:82 Loading configuration from /opt/ejabberd/conf/ejabberd.yml
2023-12-30 03:23:05.926536+01:00 [info] <0.2280.0>@ejabberd_config:load/1:89 Configuration loaded successfully
2023-12-30 03:23:06.333227+01:00 [info] <0.2537.0>@ejabberd_systemd:init/1:94 Got no NOTIFY_SOCKET, notifications disabled
2023-12-30 03:23:06.664343+01:00 [info] <0.2536.0>@gen_mod:start_modules/0:141 Loading modules for zauber.dynv6.net
2023-12-30 03:23:06.887456+01:00 [info] <0.2631.0>@mod_stun_disco:parse_listener/1:616 Going to offer STUN/TURN service: a.b.c.d:3478 (udp)
2023-12-30 03:23:07.338042+01:00 [info] <0.2689.0>@mod_mqtt:init_topic_cache/2:644 Building MQTT cache for zauber.dynv6.net, this may take a while

Bug description

After running sudo ejabberdctl restart ejabberd has written the log seen above and crashed. The ejabberd server was not available that day. There were no error messages and I could not find any information about what went wrong.

Later, after a manual sudo service ejabberd start ejabberd started normally and has written the following lines to the log:

[...]
2024-01-01 07:52:48.997734+01:00 [info] <0.591.0>@mod_mqtt:init_topic_cache/2:644 Building MQTT cache for zauber.dynv6.net, this may take a while
2024-01-01 07:52:49.288450+01:00 [info] <0.179.0>@ejabberd_cluster_mnesia:wait_for_sync/1:123 Waiting for Mnesia synchronization to complete
2024-01-01 07:52:49.351047+01:00 [warning] <0.472.0>@ejabberd_pkix:log_warnings/1:393 Invalid certificate in /etc/letsencrypt/live/zauber.dynv6.net/fullchain.pem: at line 58: certificate is signed by unknown CA
2024-01-01 07:52:49.460733+01:00 [info] <0.179.0>@ejabberd_app:start/2:63 ejabberd 23.10.0 is started in the node ejabberd@localhost in 3.41s

Why could sudo ejabberdctl restart bring the server down? What could make ejabberd stuck/crash when "Building MQTT cache"?

licaon-kter commented 6 months ago

Is this with loglevel: info ?

Can you try this fix https://github.com/processone/ejabberd/issues/4109#issuecomment-1779209127 ?

nandlab commented 6 months ago

Yes, I had loglevel: info. I have now changed it to loglevel: debug.

nandlab commented 6 months ago

I am closing this as duplicate for now. If this issue still persists after the new commits, I will reopen it.

nandlab commented 6 months ago

I set EJABBERD_OPTS="" in ejabberdctl.cfg but ejabberd still crashed on startup.

ejabberd.yml ```yaml include_config_file: /var/www/html/ipa.yml: # 'SELF_IPV4A' # 'SELF_IPV6A' allow_only: - "define_macro" /opt/ejabberd/conf/ejabberd-secret-credentials.yml: # 'SQL_PASSWORD' allow_only: - "define_macro" hosts: - zauber.dynv6.net hide_sensitive_log_data: true loglevel: debug sql_type: mysql sql_server: "localhost" sql_database: "ejabberd" sql_username: "ejabberd" sql_password: 'SQL_PASSWORD' sql_prepared_statements: true auth_method: [sql] default_db: sql # update_sql_schema: true certfiles: - /etc/letsencrypt/live/zauber.dynv6.net/fullchain.pem - /etc/letsencrypt/live/zauber.dynv6.net/privkey.pem listen: - port: 5222 ip: "::" module: ejabberd_c2s max_stanza_size: 262144 shaper: c2s_shaper access: c2s starttls_required: true - port: 5223 ip: "::" module: ejabberd_c2s max_stanza_size: 262144 shaper: c2s_shaper access: c2s tls: true - port: 5269 ip: "::" module: ejabberd_s2s_in max_stanza_size: 524288 shaper: s2s_shaper - port: 5443 ip: "::" module: ejabberd_http tls: true request_handlers: /admin: ejabberd_web_admin /api: mod_http_api /bosh: mod_bosh /captcha: ejabberd_captcha /upload: mod_http_upload /ws: ejabberd_http_ws - port: 5280 ip: "::" module: ejabberd_http request_handlers: /admin: ejabberd_web_admin # /.well-known/acme-challenge: ejabberd_acme - port: 3478 ip: "::" transport: udp module: ejabberd_stun use_turn: true ## The server's public IPv4 address: turn_ipv4_address: 'SELF_IPV4A' ## The server's public IPv6 address: # turn_ipv6_address: "2001:db8::3" - port: 1883 ip: "::" module: mod_mqtt backlog: 1000 acme: auto: false s2s_use_starttls: optional captcha_cmd: /usr/local/bin/captcha.sh captcha_url: auto acl: admin: user: admin1@zauber.dynv6.net local: user_regexp: "" loopback: ip: - 127.0.0.0/8 - ::1/128 access_rules: local: allow: local c2s: deny: blocked allow: all announce: allow: admin configure: allow: admin muc_create: allow: local pubsub_createnode: allow: local trusted_network: allow: loopback api_permissions: "console commands": from: - ejabberd_ctl who: all what: "*" "admin access": who: access: allow: - acl: loopback - acl: admin oauth: scope: "ejabberd:admin" access: allow: - acl: loopback - acl: admin what: - "*" - "!stop" - "!start" "public commands": who: ip: 127.0.0.1/8 what: - status - connected_users_number shaper: normal: rate: 3000 burst_size: 20000 fast: 100000 shaper_rules: max_user_sessions: 10 max_user_offline_messages: 5000: admin 100: all c2s_shaper: none: admin normal: all s2s_shaper: fast modules: mod_adhoc: {} mod_admin_extra: {} mod_announce: access: announce mod_avatar: {} mod_blocking: {} mod_bosh: {} mod_caps: {} mod_carboncopy: {} mod_client_state: {} mod_configure: {} mod_disco: {} mod_fail2ban: {} mod_http_api: {} mod_http_upload: put_url: https://@HOST@:5443/upload custom_headers: "Access-Control-Allow-Origin": "https://@HOST@" "Access-Control-Allow-Methods": "GET,HEAD,PUT,OPTIONS" "Access-Control-Allow-Headers": "Content-Type" mod_last: {} mod_mam: db_type: sql assume_mam_usage: true default: always mod_mqtt: {} mod_muc: access: - allow access_admin: - allow: admin access_create: muc_create access_persistent: muc_create access_mam: - allow default_room_options: mam: true mod_muc_admin: {} mod_offline: access_max_user_messages: max_user_offline_messages mod_ping: {} mod_privacy: {} mod_private: {} mod_proxy65: access: local max_connections: 5 mod_pubsub: access_createnode: pubsub_createnode plugins: - flat - pep force_node_config: ## Avoid buggy clients to make their bookmarks public storage:bookmarks: access_model: whitelist mod_push: {} mod_push_keepalive: {} mod_register: captcha_protected: true mod_roster: versioning: true mod_s2s_dialback: {} mod_shared_roster: {} mod_stream_mgmt: resend_on_timeout: if_offline mod_stun_disco: {} mod_vcard: {} mod_vcard_xupdate: {} mod_version: show_os: false ```
ejabberd.log ends with ``` 2024-01-04 03:17:15.185020+01:00 [info] <0.1870.0>@mod_mqtt:init_topic_cache/2:644 Building MQTT cache for zauber.dynv6.net, this may take a while 2024-01-04 03:17:15.186906+01:00 [debug] <0.1790.0>@supervisor:report_progress/2:1565 PROGRESS REPORT: supervisor: {local,ejabberd_gen_mod_sup} started: [{pid,<0.1870.0>}, {id,'mod_mqtt_zauber.dynv6.net'}, {mfargs,{gen_server,start_link, [{local,'mod_mqtt_zauber.dynv6.net'}, mod_mqtt, [<<"zauber.dynv6.net">>, #{cache_size => 1000,max_queue => 5000, cache_missed => true,db_type => sql, queue_type => ram, cache_life_time => 3600000, use_cache => true,ram_db_type => mnesia, max_topic_depth => 8, match_retained_limit => 1000, session_expiry => 300000, access_publish => [], access_subscribe => [], max_topic_aliases => 100}], []]}}, {restart_type,transient}, {significant,false}, {shutdown,60000}, {child_type,worker}] 2024-01-04 03:17:15.187762+01:00 [debug] <0.1717.0>@gen_mod:start_module/4:164 Loading mod_proxy65 at zauber.dynv6.net 2024-01-04 03:17:15.208536+01:00 [debug] <0.1743.0>@supervisor:report_progress/2:1565 PROGRESS REPORT: supervisor: {local,ejabberd_listener} started: [{pid,<0.1874.0>}, {id,{7777,{192,168,178,55},tcp}}, {mfargs, {ejabberd_listener,start, [{7777,{192,168,178,55},tcp}, mod_proxy65_stream, #{access => local,name => <<"SOCKS5 Bytestreams">>, port => 7777,recbuf => 65536,send_timeout => 15000, sndbuf => 65536,hosts => [], host => <<"proxy.zauber.dynv6.net">>, ip => {192,168,178,55}, supervisor => true,hostname => undefined, backlog => 128,unix_socket => #{},shaper => none, transport => tcp,max_connections => 5, ram_db_type => mnesia,accept_interval => 0, use_proxy_protocol => false, server_host => <<"zauber.dynv6.net">>, vcard => undefined,auth_type => anonymous}]}}, {restart_type,transient}, {significant,false}, {shutdown,brutal_kill}, {child_type,worker}] 2024-01-04 03:17:15.209713+01:00 [debug] <0.1758.0>@supervisor:report_progress/2:1565 PROGRESS REPORT: supervisor: {local,ejabberd_backend_sup} started: [{pid,<0.1875.0>}, {id,mod_proxy65_mnesia}, {mfargs,{mod_proxy65_mnesia,start_link,[]}}, {restart_type,transient}, {significant,false}, {shutdown,5000}, {child_type,worker}] 2024-01-04 03:17:15.210396+01:00 [debug] <0.1877.0>@ejabberd_router:register_route/4:174 Route registered: proxy.zauber.dynv6.net 2024-01-04 03:17:15.210786+01:00 [debug] <0.1876.0>@supervisor:report_progress/2:1565 PROGRESS REPORT: supervisor: {local,'ejabberd_mod_proxy65_zauber.dynv6.net'} started: [{pid,<0.1877.0>}, {id,mod_proxy65_service}, {mfargs, {mod_proxy65_service,start_link,[<<"zauber.dynv6.net">>]}}, {restart_type,transient}, {significant,false}, {shutdown,5000}, {child_type,worker}] 2024-01-04 03:17:15.211253+01:00 [debug] <0.1790.0>@supervisor:report_progress/2:1565 PROGRESS REPORT: supervisor: {local,ejabberd_gen_mod_sup} started: [{pid,<0.1876.0>}, {id,'ejabberd_mod_proxy65_zauber.dynv6.net'}, {mfargs,{mod_proxy65,start_link,[<<"zauber.dynv6.net">>]}}, {restart_type,transient}, {significant,false}, {shutdown,infinity}, {child_type,supervisor}] 2024-01-04 03:17:15.211649+01:00 [debug] <0.1717.0>@supervisor:report_progress/2:1565 PROGRESS REPORT: supervisor: {local,ejabberd_sup} started: [{pid,<0.1790.0>}, {id,ejabberd_gen_mod_sup}, {mfargs,{gen_mod,start_link,[]}}, {restart_type,permanent}, {significant,false}, {shutdown,infinity}, {child_type,supervisor}] ```
nandlab commented 6 months ago

I noticed this only happens sometimes with ejabberdctl restart. ejabberd starts fine with service ejabberd restart which internally does ejabberdctl stop and ejabberdctl start.

badlop commented 5 months ago

I can reproduce the problem (or at least I think it's the same problem) one of every two or three restarts.

The problem appears when using any binary installer from master branch, 23.10, 23.04, 22.05

When running ejabberdctl restart, the functions called are:

It is very easy to reproduce:

The problem does not appear when using the container, or ejabberd compiled from source code with Erlang 26.1 (the one used in the installers). In those cases, ejabberdctl restart successfully restart ejabberd everytime.

nosnilmot commented 5 months ago

Is this seen with OpenSSL 3, and when no dhfile is configured for listeners in ejabberd.yml? if so, this may be fixed by https://github.com/processone/fast_tls/pull/63

badlop commented 5 months ago

The problem is reproducible when using binary installers (23.10, older releases, also master release). Those installers are built by tools/make-binaries which downloads and compiles openssl 1.1.1w

https://github.com/processone/ejabberd/blob/66d701e788af7d1f86c253481ab2d8411bc92a3c/tools/make-binaries#L73

As you mentioned, fast_tls included in git some fixes. It's now tagged and used by ejabberd, but testing installers that use latest fast_tls throws the same problem.

I guess it's worth trying to update binary installers to OpenSSL 3, but unfortunately I get compilation problems in that case.

Neustradamus commented 5 months ago

@badlop: OpenSSL 1.1.1 branch is now EOL, yes please look for OpenSSL 3.2.x:

badlop commented 4 months ago

The newest ejabberd installers use OpenSSL 3.2.1 (since 1962fc88d62ede7676bf7f91f72094ef3f105839) and fast_tls 1.1.18, which includes the fix https://github.com/processone/fast_tls/commit/da16622da621eb4d74f047e8ac376d701782007b. Those installers can be downloaded from https://github.com/processone/ejabberd/actions/runs/7844499079

However, the problem mentioned in this issue is still present.