processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6k stars 1.5k forks source link

[BUG] MS SQL database backend reason: {port_exit,memory_allocation_failed} with Erlang >25.3 #4178

Closed sando38 closed 1 month ago

sando38 commented 3 months ago

Environment

Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml

loglevel: info

Errors from error.log/crash.log

``` 2024-03-16T20:39:12.6496590Z 2024-03-16 20:32:36.414879+00:00 [info] <0.423.0>@gen_mod:start_modules/0:141 Loading modules for 8310292872-12.example.com 2024-03-16T20:39:12.6498391Z 2024-03-16 20:32:36.449292+00:00 [info] <0.374.0>@ejabberd_mnesia:create/2:270 Creating Mnesia ram table 'pubsub_last_item' 2024-03-16T20:39:12.6500208Z 2024-03-16 20:32:36.461769+00:00 [info] <0.374.0>@ejabberd_mnesia:create/2:270 Creating Mnesia ram table 'mod_register_ip' 2024-03-16T20:39:12.6502116Z 2024-03-16 20:32:36.488646+00:00 [info] <0.608.0>@mod_stun_disco:parse_listener/1:626 Going to offer STUN service: 10.244.3.7:3478 (udp) 2024-03-16T20:39:12.6504254Z 2024-03-16 20:32:36.551778+00:00 [info] <0.374.0>@ejabberd_mnesia:create/2:270 Creating Mnesia ram table 'muc_online_room' 2024-03-16T20:39:12.6506198Z 2024-03-16 20:32:36.643316+00:00 [error] <0.649.0> ODBC: received unexpected info: {tcp_closed,#Port<0.91>} 2024-03-16T20:39:12.6507055Z 2024-03-16T20:39:12.6508629Z 2024-03-16 20:32:36.643508+00:00 [error] <0.632.0>@ejabberd_sql:check_error/2:1404 SQL query 'select name, opts from muc_room where host='conference.8310292872-12.example.com'' failed: connection_closed 2024-03-16T20:39:12.6511449Z 2024-03-16 20:32:36.643598+00:00 [error] <0.632.0>@ejabberd_sql:check_error/2:1397 SQL query 'Q80627668' at {mod_muc_sql,{293,23}} failed: connection_closed 2024-03-16T20:39:12.6513413Z 2024-03-16 20:32:36.643986+00:00 [error] <0.649.0>@gen_server:error_info/8:1391 ** Generic server <0.649.0> terminating 2024-03-16T20:39:12.6514552Z ** Last message in was {#Port<0.69>,{exit_status,2}} 2024-03-16T20:39:12.6515218Z ** When Server state == {state,#Port<0.69>, 2024-03-16T20:39:12.6515798Z {<0.632.0>, 2024-03-16T20:39:12.6516387Z #Ref<0.3142729776.3845652482.250765>}, 2024-03-16T20:39:12.6517084Z <0.632.0>,undefined,on,false,false,off, 2024-03-16T20:39:12.6517768Z connected,undefined,0, 2024-03-16T20:39:12.6518400Z [#Port<0.67>,#Port<0.68>], 2024-03-16T20:39:12.6519032Z #Port<0.90>,#Port<0.91>} 2024-03-16T20:39:12.6519621Z ** Reason for termination == 2024-03-16T20:39:12.6520117Z ** {port_exit,memory_allocation_failed} 2024-03-16T20:39:12.6520518Z 2024-03-16T20:39:12.6521166Z 2024-03-16 20:32:36.645543+00:00 [error] <0.649.0>@proc_lib:crash_report/4:584 CRASH REPORT: 2024-03-16T20:39:12.6522026Z crasher: 2024-03-16T20:39:12.6522380Z initial call: odbc:init/1 2024-03-16T20:39:12.6522837Z pid: <0.649.0> 2024-03-16T20:39:12.6523221Z registered_name: [] 2024-03-16T20:39:12.6523768Z exception exit: {port_exit,memory_allocation_failed} 2024-03-16T20:39:12.6524678Z in function gen_server:handle_common_reply/8 (gen_server.erl, line 1226) 2024-03-16T20:39:12.6525483Z ancestors: [odbc_sup,<0.344.0>] 2024-03-16T20:39:12.6525987Z message_queue_len: 1 2024-03-16T20:39:12.6526524Z messages: [{'EXIT',#Port<0.69>,normal}] 2024-03-16T20:39:12.6527072Z links: [<0.632.0>,<0.345.0>] 2024-03-16T20:39:12.6527535Z dictionary: [] 2024-03-16T20:39:12.6527907Z trap_exit: true 2024-03-16T20:39:12.6528287Z status: running 2024-03-16T20:39:12.6528658Z heap_size: 6772 2024-03-16T20:39:12.6529036Z stack_size: 28 2024-03-16T20:39:12.6529418Z reductions: 19518 2024-03-16T20:39:12.6529805Z neighbours: 2024-03-16T20:39:12.6530031Z 2024-03-16T20:39:12.6530830Z 2024-03-16 20:32:36.649964+00:00 [warning] <0.632.0>@ejabberd_sql:handle_reconnect/2:491 mssql connection failed: 2024-03-16T20:39:12.6531919Z ** Reason: {port_exit,memory_allocation_failed} 2024-03-16T20:39:12.6532510Z ** Retry after: 5 seconds 2024-03-16T20:39:12.6533452Z 2024-03-16 20:32:36.649979+00:00 [error] <0.345.0>@supervisor:do_restart/3:759 SUPERVISOR REPORT: 2024-03-16T20:39:12.6534377Z supervisor: {local,odbc_sup} 2024-03-16T20:39:12.6534891Z errorContext: child_terminated 2024-03-16T20:39:12.6535462Z reason: {port_exit,memory_allocation_failed} 2024-03-16T20:39:12.6536057Z offender: [{pid,<0.649.0>}, 2024-03-16T20:39:12.6536516Z {id,[]}, 2024-03-16T20:39:12.6537002Z {mfargs,{odbc,start_link_sup,undefined}}, 2024-03-16T20:39:12.6537632Z {restart_type,temporary}, 2024-03-16T20:39:12.6538183Z {significant,false}, 2024-03-16T20:39:12.6538677Z {shutdown,7000}, 2024-03-16T20:39:12.6539153Z {child_type,worker}] 2024-03-16T20:39:12.6539496Z 2024-03-16T20:39:12.6540870Z 2024-03-16 20:32:36.720319+00:00 [warning] <0.675.0>@ejabberd_captcha:get_prog_name/0:444 The option captcha_cmd is not configured, but some module wants to use the CAPTCHA feature. 2024-03-16T20:39:12.6543495Z 2024-03-16 20:32:36.754287+00:00 [info] <0.190.0>@ejabberd_cluster_mnesia:wait_for_sync/1:123 Waiting for Mnesia synchronization to complete 2024-03-16T20:39:12.6546970Z 2024-03-16 20:32:36.772555+00:00 [warning] <0.511.0>@ejabberd_pkix:log_warnings/1:393 Invalid certificate in /opt/ejabberd/certs/custom-cert/cert.pem: at line 1: unused private key 2024-03-16T20:39:12.6549459Z 2024-03-16 20:32:36.793597+00:00 [warning] <0.511.0>@ejabberd_pkix:check_domain_certfiles/1:312 No certificate found matching 8310292872-12.example.com 2024-03-16T20:39:12.6551565Z 2022024-03-16 20:32:36.643316+00:00 [error] <0.649.0> ODBC: received unexpected info: {tcp_closed,#Port<0.91>} 2024-03-16T20:39:12.6552430Z 2024-03-16T20:39:12.6553964Z 2024-03-16 20:32:36.643508+00:00 [error] <0.632.0>@ejabberd_sql:check_error/2:1404 SQL query 'select name, opts from muc_room where host='conference.8310292872-12.example.com'' failed: connection_closed 2024-03-16T20:39:12.6556589Z 2024-03-16 20:32:36.643598+00:00 [error] <0.632.0>@ejabberd_sql:check_error/2:1397 SQL query 'Q80627668' at {mod_muc_sql,{293,23}} failed: connection_closed 2024-03-16T20:39:12.6558551Z 2024-03-16 20:32:36.643986+00:00 [error] <0.649.0>@gen_server:error_info/8:1391 ** Generic server <0.649.0> terminating 2024-03-16T20:39:12.6559688Z ** Last message in was {#Port<0.69>,{exit_status,2}} 2024-03-16T20:39:12.6560349Z ** When Server state == {state,#Port<0.69>, 2024-03-16T20:39:12.6560925Z {<0.632.0>, 2024-03-16T20:39:12.6561508Z #Ref<0.3142729776.3845652482.250765>}, 2024-03-16T20:39:12.6562205Z <0.632.0>,undefined,on,false,false,off, 2024-03-16T20:39:12.6562880Z connected,undefined,0, 2024-03-16T20:39:12.6563515Z [#Port<0.67>,#Port<0.68>], 2024-03-16T20:39:12.6564143Z #Port<0.90>,#Port<0.91>} 2024-03-16T20:39:12.6564731Z ** Reason for termination == 2024-03-16T20:39:12.6565221Z ** {port_exit,memory_allocation_failed} 2024-03-16T20:39:12.6565620Z 2024-03-16T20:39:12.6566287Z 2024-03-16 20:32:36.645543+00:00 [error] <0.649.0>@proc_lib:crash_report/4:584 CRASH REPORT: 2024-03-16T20:39:12.6567211Z crasher: 2024-03-16T20:39:12.6567598Z initial call: odbc:init/1 2024-03-16T20:39:12.6568080Z pid: <0.649.0> 2024-03-16T20:39:12.6568491Z registered_name: [] 2024-03-16T20:39:12.6569067Z exception exit: {port_exit,memory_allocation_failed} 2024-03-16T20:39:12.6570030Z in function gen_server:handle_common_reply/8 (gen_server.erl, line 1226) 2024-03-16T20:39:12.6570912Z ancestors: [odbc_sup,<0.344.0>] 2024-03-16T20:39:12.6571486Z message_queue_len: 1 2024-03-16T20:39:12.6572147Z messages: [{'EXIT',#Port<0.69>,normal}] 2024-03-16T20:39:12.6572771Z links: [<0.632.0>,<0.345.0>] 2024-03-16T20:39:12.6573276Z dictionary: [] 2024-03-16T20:39:12.6573697Z trap_exit: true 2024-03-16T20:39:12.6574131Z status: running 2024-03-16T20:39:12.6574544Z heap_size: 6772 2024-03-16T20:39:12.6574968Z stack_size: 28 2024-03-16T20:39:12.6575383Z reductions: 19518 2024-03-16T20:39:12.6575809Z neighbours: 2024-03-16T20:39:12.6576051Z 2024-03-16T20:39:12.6576798Z 2024-03-16 20:32:36.649979+00:00 [error] <0.345.0>@supervisor:do_restart/3:759 SUPERVISOR REPORT: 2024-03-16T20:39:12.6577739Z supervisor: {local,odbc_sup} 2024-03-16T20:39:12.6578269Z errorContext: child_terminated 2024-03-16T20:39:12.6578835Z reason: {port_exit,memory_allocation_failed} 2024-03-16T20:39:12.6579427Z offender: [{pid,<0.649.0>}, 2024-03-16T20:39:12.6579877Z {id,[]}, 2024-03-16T20:39:12.6580372Z {mfargs,{odbc,start_link_sup,undefined}}, 2024-03-16T20:39:12.6580992Z {restart_type,temporary}, 2024-03-16T20:39:12.6581555Z {significant,false}, 2024-03-16T20:39:12.6582045Z {shutdown,7000}, 2024-03-16T20:39:12.6582536Z {child_type,worker}] 2024-03-16T20:39:12.6583207Z 2024-03-16T20:39:12.6584421Z 4-03-16 20:32:36.793707+00:00 [warning] <0.511.0>@ejabberd_pkix:check_domain_certfiles/1:312 No certificate found matching pubsub.8310292872-12.example.com 2024-03-16T20:39:12.6587024Z 2024-03-16 20:32:36.793821+00:00 [warning] <0.511.0>@ejabberd_pkix:check_domain_certfiles/1:312 No certificate found matching upload.8310292872-12.example.com 2024-03-16T20:39:12.6589450Z 2024-03-16 20:32:36.793968+00:00 [warning] <0.511.0>@ejabberd_pkix:check_domain_certfiles/1:312 No certificate found matching conference.8310292872-12.example.com 2024-03-16T20:39:12.6592258Z 2024-03-16 20:32:36.794082+00:00 [warning] <0.511.0>@ejabberd_pkix:check_domain_certfiles/1:312 No certificate found matching proxy.8310292872-12.example.com 2024-03-16T20:39:12.6595095Z 2024-03-16 20:32:36.795105+00:00 [info] <0.190.0>@ejabberd_app:start/2:63 ejabberd 24.2.0 is started in the node 'ejabberd@ejabberd-0.ejabberd-headless.default.svc.cluster.local' in 1.31s 2024-03-16T20:39:12.6597822Z 2024-03-16 20:32:36.795411+00:00 [info] <0.672.0>@ejabberd_listener:init/4:163 Start accepting TCP connections at 10.244.3.7:7777 for mod_proxy65_stream 2024-03-16T20:39:12.6600095Z 2024-03-16 20:32:36.795486+00:00 [info] <0.505.0>@ejabberd_listener:init/4:163 Start accepting TCP connections at 127.0.0.1:5281 for ejabberd_http 2024-03-16T20:39:12.6602346Z 2024-03-16 20:32:36.795559+00:00 [info] <0.506.0>@ejabberd_listener:init/4:163 Start accepting TCP connections at [::]:5222 for ejabberd_c2s 2024-03-16T20:39:12.6604556Z 2024-03-16 20:32:36.795579+00:00 [info] <0.508.0>@ejabberd_listener:init/4:163 Start accepting TLS connections at [::]:5443 for ejabberd_http 2024-03-16T20:39:12.6607033Z 2024-03-16 20:32:36.795590+00:00 [info] <0.510.0>@ejabberd_listener:init/4:129 Start accepting UDP connections at [::]:3478 for ejabberd_stun 2024-03-16T20:39:12.6609256Z 2024-03-16 20:32:36.795567+00:00 [info] <0.507.0>@ejabberd_listener:init/4:163 Start accepting TCP connections at [::]:5280 for ejabberd_http 2024-03-16T20:39:12.6611427Z 2024-03-16 20:32:36.795583+00:00 [info] <0.509.0>@ejabberd_listener:init/4:163 Start accepting TCP connections at [::]:5269 for ejabberd_s2s_in 2024-03-16T20:39:12.6613508Z 2024-03-16 20:32:37.706391+00:00 [info] <0.505.0>@ejabberd_listener:accept/7:344 (<0.745.0>) Accepted connection 127.0.0.1:47566 -> 127.0.0.1:5281 ```

Bug description

I started updating the ejabberd helm chart to newest 24.02. While doing that and updating the corresponding container image I found that there is a problem with MS SQL. Therefore, I tested the official ghcr.io/processone/ejabberd:24.02 as well to see if it depends on my custom image, but it does not (see error message above).

reason: {port_exit,memory_allocation_failed}

What solved the issue was reverting the container build from erlang 26.2 / elixir 16.1 to erlang 25.3 / elixir 14.4 . I haven't tested anything in between.

It most likely is unrelated to Alpine packages, because I also tried it with a custom Debian based variant with the same combination erlang 26.2 / elixir 16.1 .

sando38 commented 3 months ago

@badlop as discussed

badlop commented 3 months ago

Erlang versions affected

I've tested compilation of ejabberd master git from source code with different Erlang/OTP versions and the results are:

Same problem with the ghcr.io/processone/ejabberd:latest container image.


Problematic commit

Using git bisect on erlang, I found that the problematic commit is https://github.com/erlang/otp/commit/53ef5df40c733ce3d8215c5c98805f99f378f656

Using erlang OTP-27.0-rc1 and reverting that commit, ejabberd works perfectly with mssql.


Minimal reproduction steps

Right now the only way to reproduce the problem is using ejabberd... I found no way to provoke the problem in odbc without using ejabberd.


Hypothesis

Looking at https://erlang.org/documentation/doc-15.0-rc2/lib/odbc-2.14.3/doc/html/getting_started.html#using-the-erlang-api

Trying to find some reproduction steps that wouldn't involve ejabberd:

Setup:

cat >"/tmp/odbcinst.ini" <<-'EOF'
[ODBC]
Driver = libtdsodbc.so
EOF

This works correctly and returns {updated,undefined}:

erl

os:putenv("ODBCSYSINI", "/tmp/").
odbc:start().
{ok, Ref} = odbc:connect("DRIVER=ODBC;SERVER=mssql;DATABASE=ejabberd_test;UID=ejabberd_test;PWD=ejabberd_Test1;PORT=1433;CLIENT_CHARSET=UTF-8;", []).

erlang:process_info(Ref).

odbc:sql_query(Ref, "CREATE TABLE EMPLOYEE (NR integer, FIRSTNAME  char varying(20), LASTNAME  char varying(20), GENDER char(1), PRIMARY KEY(NR))").

odbc:describe_table(Ref, "EMPLOYEE").

Interestingly, if we provoke some error in erlang, the odbc query fails with {error,connection_closed}:

erl

os:putenv("ODBCSYSINI", "/tmp/").
odbc:start().
{ok, Ref} = odbc:connect("DRIVER=ODBC;SERVER=mssql;DATABASE=ejabberd_test;UID=ejabberd_test;PWD=ejabberd_Test1;PORT=1433;CLIENT_CHARSET=UTF-8;", []).

erlang:process_info(Ref).
odbc:describe_table(Ref, "EMPLOYEE").

aa:bb().

erlang:process_info(Ref).
odbc:describe_table(Ref, "EMPLOYEE").

One wild hypothesis: maybe, when using erlang 26.0-rc2 or higher, ejabberd provokes some error that closes the odbc connection. That error is not provoked with older erlang versions. And the error itself is being silenced, so we are not aware of it.

petertirrell commented 1 month ago

I think I am experiencing this exact same issue, on ejabberd 24.02. We're using the OTP that came with that. Is the current solution to try and load and use a specific working version of OTP?

badlop commented 1 month ago

I think I am experiencing this exact same issue, on ejabberd 24.02. We're using the OTP that came with that.

Aha, the binary installers and containers for ejabberd 24.02 include Erlang/OTP 26.2, and are affected by this problem.

Is the current solution to try and load and use a specific working version of OTP?

Yes, either using Erlang 26.0-rc2 or older...

Or compile Erlang/OTP from source with a small patch that revers the problematic commit.

I've applied that solution in branch 24.02-otp-patch-mssql. This is just ejabberd 24.02 with a small patch for Erlang 26.2 that reverts the problematic commit.

There are available binary installers and containers. I've tried the amd64 container and seems to work: at least it starts, connects to MSSQL and I can register an account.

badlop commented 1 month ago

As a temporary workaround, the ejabberd binary installers now include a patch that reverts the problematic commit in Erlang/OTP.