Ejabberd 23.04 randomly crashes without generating error logs and crash dumps

519790441 commented 1 year ago

Environment

ejabberd version: 23.4.0
Erlang version: Erlang/OTP 25 [erts-13.2] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit:ns]
OS: Centos 7.4.1708. 2 Core CPU with 4GB RAM.
Installed from: source & official binary installer

Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml

```yaml loglevel: info hosts: xxxxxx certfiles: yyyyyy listen: - port: 5222 ip: "::" module: ejabberd_c2s max_stanza_size: 262144 shaper: c2s_shaper access: c2s starttls_required: true - port: 5223 ip: "::" tls: true module: ejabberd_c2s max_stanza_size: 262144 shaper: c2s_shaper access: c2s starttls_required: true - port: 5269 ip: "::" module: ejabberd_s2s_in max_stanza_size: 524288 - port: 5443 ip: "::" module: ejabberd_http tls: true request_handlers: /admin: ejabberd_web_admin /api: mod_http_api /bosh: mod_bosh /captcha: ejabberd_captcha /upload: mod_http_upload /ws: ejabberd_http_ws - port: 5280 ip: "::" module: ejabberd_http request_handlers: /admin: ejabberd_web_admin /.well-known/acme-challenge: ejabberd_acme - port: 3478 ip: "::" transport: udp module: ejabberd_stun use_turn: true - port: 1883 ip: "::" module: mod_mqtt backlog: 1000 s2s_use_starttls: optional acl: local: user_regexp: "" loopback: ip: - 127.0.0.0/8 - ::1/128 access_rules: local: allow: local c2s: deny: blocked allow: all announce: allow: admin configure: allow: admin muc_create: allow: local pubsub_createnode: allow: local trusted_network: allow: loopback api_permissions: "console commands": from: - ejabberd_ctl who: all what: "*" "admin access": who: access: allow: - acl: loopback - acl: admin oauth: scope: "ejabberd:admin" access: allow: - acl: loopback - acl: admin what: - "*" - "!stop" - "!start" "public commands": who: ip: 127.0.0.1/8 what: - status - connected_users_number shaper: normal: rate: 3000 burst_size: 20000 fast: 100000 shaper_rules: max_user_sessions: 10 max_user_offline_messages: 5000: admin 100: all c2s_shaper: none: admin normal: all s2s_shaper: fast auth_method: external extauth_program: "/opt/ejabberd/auth.php" auth_use_cache: false modules: mod_adhoc: {} mod_admin_extra: {} mod_announce: access: announce mod_avatar: {} mod_blocking: {} mod_bosh: {} mod_caps: {} mod_carboncopy: {} mod_client_state: {} mod_configure: {} mod_disco: {} mod_http_api: {} mod_http_upload: put_url: https://@HOST@:5443/upload custom_headers: "Access-Control-Allow-Origin": "https://@HOST@" "Access-Control-Allow-Methods": "GET,HEAD,PUT,OPTIONS" "Access-Control-Allow-Headers": "Content-Type" mod_last: {} mod_mam: assume_mam_usage: true default: always mod_mqtt: {} mod_muc: access: - allow access_admin: - allow: admin access_create: muc_create access_persistent: muc_create access_mam: - allow default_room_options: mam: true mod_muc_admin: {} mod_offline: access_max_user_messages: max_user_offline_messages mod_ping: {} mod_privacy: {} mod_private: {} mod_proxy65: access: local max_connections: 5 mod_pubsub: access_createnode: pubsub_createnode plugins: - flat - pep force_node_config: storage:bookmarks: access_model: whitelist mod_push: {} mod_push_keepalive: {} mod_register: ip_access: trusted_network mod_roster: versioning: true mod_s2s_dialback: {} mod_shared_roster: {} mod_stream_mgmt: resend_on_timeout: if_offline mod_stun_disco: {} mod_vcard: {} mod_vcard_xupdate: {} mod_version: show_os: false ```

Errors from error.log/crash.log

File crash.log not generated. There are only some 'ejabberd_acme:issue_request/7:246 Failed to request certificate for XXXXXX' outputs in file error.log.

Bug description

Please, give us a precise description (what does not work, what is expected, etc.)

I have been struggling with this issue for the past few months.

A few years ago, I started running the official binary version of ejabberd 19.05 in the official docker container of Centos 7.4.1708. Using ejabberdctl foreground, it has been working well, with around 2000 ejabberd users on average.

Recently, due to business needs, it was necessary to change the source code in ejabberd 19.05. So I compiled Erlang OTP 21 and ejabberd 19.05 source code, and to ensure stability, I conducted testing without modifying the ejabberd-19.05 source code.

At this point, the problem of random crashes was discovered, and a crash dump was generated for the first time, starting with the following content.

``` =erl_crash_dump:0.5 Tue Mar 28 23:46:19 2023 Slogan: Failed to read from erl_child_setup: 104 System version: Erlang/OTP 21 [erts-10.3.5.19] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe] Compiled: Tue Mar 21 06:50:27 2023 Taints: Atoms: 6651 Calling Thread: scheduler:1 =scheduler:1 Scheduler Sleep Info Flags: Scheduler Sleep Info Aux Work: Current Port: #Port<0.0> Run Queue Max Length: 0 Run Queue High Length: 0 Run Queue Normal Length: 0 Run Queue Low Length: 0 Run Queue Port Length: 1 Run Queue Flags: UNKNOWN(170917904) | OUT_OF_WORK | HALFTIME_OUT_OF_WORK | NONEMPTY | EXEC Current Process: =scheduler:2 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Port: Run Queue Max Length: 0 Run Queue High Length: 0 Run Queue Normal Length: 0 Run Queue Low Length: 0 Run Queue Port Length: 0 Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK Current Process: =dirty_cpu_scheduler:3 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_cpu_scheduler:4 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_cpu_run_queue Run Queue Max Length: 0 Run Queue High Length: 0 Run Queue Normal Length: 0 Run Queue Low Length: 0 Run Queue Port Length: 0 Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK =dirty_io_scheduler:5 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:6 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:7 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:8 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:9 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:10 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:11 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:12 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:13 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_scheduler:14 Scheduler Sleep Info Flags: SLEEPING | TSE_SLEEPING | WAITING Scheduler Sleep Info Aux Work: Current Process: =dirty_io_run_queue Run Queue Max Length: 0 Run Queue High Length: 0 Run Queue Normal Length: 0 Run Queue Low Length: 0 Run Queue Port Length: 0 Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK =memory total: 409545304 processes: 202068880 processes_used: 202052336 system: 207476424 atom: 194313 atom_used: 171370 binary: 122728 code: 2719364 ets: 328184 =hash_table:atom_tab size: 4813 used: 3567 objs: 6650 depth: 7 =index_table:atom_tab ```

Several random crashes occurred later, without generating a crash dump file or related logs. So, I switched to source code compilation for Erlang-OTP 25 and Ejabberd 23.04, which also had the same issue (no crash dump file generated).

Today, I switched to the official binary installation package of ejabberd 23.04, which also had the same issue (no crash dump file generated), and the problems are occurring more frequently. dmesg -T shows that there have been no OOM killer issues.

Can you indicate the next direction for investigation? As long as I switch back to the official binary version of ejabberd19.05, everything works fine. But business needs, so I have to choose to modify the source code of ejabberd.

badlop commented 1 year ago

I am not fluent enough to get some clue from the crash dump.

And you already tried most of the ideas that I would give you. I'll give some other ideas, let's hope one of them is suitable for your user case and gives positive results.

Idea A)

Compile your custom ejabberd source code with the same erlang version that was included in the installer (seems to be Erlang/OTP 21.3).

Then in your stable ejabberd 19.05 server, copy the files that you customized (probably a few *.beam files) to overwrite the installed ones.

That way, the server is running using the erlang virtual machine that is known to be stable.

Idea B)

I started running the official binary version of ejabberd 19.05 in the official docker container of Centos 7.4.1708.

You can try the ecs container image: https://hub.docker.com/r/ejabberd/ecs/tags?page=1&name=19.05

If that is stable, then you can regenerate the image with your custom ejabberd source code, just changing in Dockerfile what ejabberd source code to use, and that should be stable too.

The docker-ejabberd repository doesn't have the 19.05 tag, but the image quite probably was built based in this commit: https://github.com/processone/docker-ejabberd/commit/97dc39d9be16b9ba3617a23d1293d82235ca0af9

Or you can checkout to master branch, and try to build with:

./build.sh 19.05

That way, if the original container image is stable, you are now generating a custom container image using the exact same method (just changed the source code)

519790441 commented 1 year ago

@badlop I switched to OTP 24.3 and ejabberd22.10 (both source compiled) last week, while no longer adjusting the time zone within the docker container, so far I have not found random crashes. Before making this adjustment, it had crashed at least once a day or two. I'll report the rest later. The original docker containers using the ejabberd19.05 official installation package did not adjust the time zone, and several containers with random crash issues reported earlier did adjust the time zone.

mremond commented 1 year ago

Thanks, it seems solved then. Please comment back if you still have issues.

processone / ejabberd