processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6.09k stars 1.51k forks source link

Memory leak #1575

Closed DeerBeer closed 7 years ago

DeerBeer commented 7 years ago

What version of ejabberd are you using?

16.04.99

What operating system (version) are you using?

CentOS Linux release 7.2.1511

How did you install ejabberd (source, package, distribution)?

source

What did not work as expected? Are there error messages in the log? What was the unexpected behavior? What was the expected result?

We are running two clustered servers. They are deployed through dockers. We have memory issue that the memory consumption is constantly growing.

Here are the system parameters:

> erlang:memory().
[{total,16869880},
 {processes,3848056},
 {processes_used,3847032},
 {system,13021824},
 {atom,202481},
 {atom_used,185860},
 {binary,150000},
 {code,4413424},
 {ets,259320}]

Seams that leak is in ejabberd's system allocation, it is not getting freed and 13G/17G is used by the system

erlang:system_info(allocated_areas) is stuck it doesn't return anything

we tried to run manually GC but it isn't helped.

Do you have any advice for resolving this.

Thxank you, Laslo

zinid commented 7 years ago

erlang:memory() reports memory in bytes, so total memory is only 16Mb according to your output.

DeerBeer commented 7 years ago

Huh didn't know that, but that make it more strange. The memory situation is: ps -C beam.smp -o rss RSS 23456 20566460 2132 beam.smp is using 20G+ PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
96601 root 20 0 42.579g 0.019t 2272 S 110.5 82.7 12348:41 beam.smp

zinid commented 7 years ago

Hello, sorry for the late reply. Seems like you're connecting to the wrong node to gather the statistics. Make sure you're connected correctly (via ejabberdctl debug), clone recon utility, compile it, put produced *.beam files into ejabberd's ebin directory, and type the following commands in remote Erlang shell:

> recon_alloc:memory(allocated). %% Should match what OS reports in `top`
> recon_alloc:memory(used).
> recon_alloc:memory(usage).
> recon_alloc:memory(allocated_types).
> recon_alloc:fragmentation(max).
> recon_alloc:fragmentation(current).

Put the output here. Also report if allocated doesn't match OS's values.

DeerBeer commented 7 years ago

@zinid I'm here is the full report:

Eshell V7.0 (abort with ^G) (490c021d-debug-ejabberd@prpxmpp06)1> recon_alloc:memory(allocated). 24908216 (490c021d-debug-ejabberd@prpxmpp06)2> recon_alloc:memory(used). 17190672 (490c021d-debug-ejabberd@prpxmpp06)3> recon_alloc:memory(usage). 0.6615199597854774 (490c021d-debug-ejabberd@prpxmpp06)4> recon_alloc:memory(allocated_types). [{binary_alloc,1245712}, {driver_alloc,197136}, {eheap_alloc,2884112}, {ets_alloc,721424}, {fix_alloc,197136}, {ll_alloc,19399136}, {sl_alloc,197136}, {std_alloc,721424}, {temp_alloc,393576}] (490c021d-debug-ejabberd@prpxmpp06)5> recon_alloc:fragmentation(max). [{{ll_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.8334587278075255}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,11361416}, {mbcs_carriers_size,13631648}]}, {{eheap_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.4678741684580279}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,1103936}, {mbcs_carriers_size,2359472}]}, {{binary_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.3401197137026296}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,557312}, {mbcs_carriers_size,1638576}]}, {{ll_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.011724590364019162}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,12296}, {mbcs_carriers_size,1048736}]}, {{binary_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.543315659450645}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,890264}, {mbcs_carriers_size,1638576}]}, {{driver_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.14155932203389832}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,83520}, {mbcs_carriers_size,590000}]}, {{std_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.23769491525423728}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,140240}, {mbcs_carriers_size,590000}]}, {{sl_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.31924067796610167}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,188352}, {mbcs_carriers_size,590000}]}, {{ets_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.3242305084745763}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,191296}, {mbcs_carriers_size,590000}]}, {{ll_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.9361288747533246}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,4417360}, {mbcs_carriers_size,4718752}]}, {{eheap_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.13250991155840194}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,34760}, {mbcs_carriers_size,262320}]}, {{eheap_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.14050015248551387}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,36856}, {mbcs_carriers_size,262320}]}, {{temp_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.031282395268004144}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,4104}, {mbcs_carriers_size,131192}]}, {{temp_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.07652905664979572}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,10040}, {mbcs_carriers_size,131192}]}, {{std_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{sl_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{fix_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{ets_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{driver_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{binary_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,...}]}, {{temp_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.49960363436794925}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,...}, {...}]}, {{fix_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.012296079863647431}, {sbcs_block_size,0}, {sbcs_carriers_size,...}, {...}|...]}, {{ets_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.32286340394448504}, {sbcs_block_size,...}, {...}|...]}, {{std_alloc,2}, [{sbcs_usage,1.0},{mbcs_usage,...},{...}|...]}, {{fix_alloc,2},[{sbcs_usage,...},{...}|...]}, {{driver_alloc,2},[{...}|...]}, {{sl_alloc,...},[...]}] (490c021d-debug-ejabberd@prpxmpp06)6> recon_alloc:fragmentation(current). [{{ll_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.8319586890741311}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,11340968}, {mbcs_carriers_size,13631648}]}, {{eheap_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.40405989136552584}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,953368}, {mbcs_carriers_size,2359472}]}, {{binary_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.0184297057852189}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,20536}, {mbcs_carriers_size,1114288}]}, {{ll_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,7.933359777865926e-4}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,832}, {mbcs_carriers_size,1048736}]}, {{std_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.2239864406779661}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,132152}, {mbcs_carriers_size,590000}]}, {{ets_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.3242305084745763}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,191296}, {mbcs_carriers_size,590000}]}, {{ll_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.9344165575982802}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,4409280}, {mbcs_carriers_size,4718752}]}, {{eheap_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,262320}]}, {{eheap_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.12619701128392802}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,33104}, {mbcs_carriers_size,262320}]}, {{temp_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,131192}]}, {{temp_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,131192}]}, {{temp_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,131192}]}, {{std_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{sl_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{fix_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{ets_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{driver_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{binary_alloc,1}, [{sbcs_usage,1.0}, {mbcs_usage,0.0}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,0}, {mbcs_carriers_size,65712}]}, {{sl_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.0018261504747991235}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,120}, {mbcs_carriers_size,65712}]}, {{sl_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.007913318724129535}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,520}, {mbcs_carriers_size,...}]}, {{fix_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.012296079863647431}, {sbcs_block_size,0}, {sbcs_carriers_size,0}, {mbcs_block_size,...}, {...}]}, {{driver_alloc,0}, [{sbcs_usage,1.0}, {mbcs_usage,0.023618212807401995}, {sbcs_block_size,0}, {sbcs_carriers_size,...}, {...}|...]}, {{driver_alloc,2}, [{sbcs_usage,1.0}, {mbcs_usage,0.08607255904553202}, {sbcs_block_size,...}, {...}|...]}, {{std_alloc,2}, [{sbcs_usage,1.0},{mbcs_usage,...},{...}|...]}, {{ets_alloc,0},[{sbcs_usage,...},{...}|...]}, {{fix_alloc,2},[{...}|...]}, {{binary_alloc,...},[...]}]

zinid commented 7 years ago

What, 25Mb allocated? Even just started ejabberd without any connections consumes more on my machine. Probably docker is doing something weird.

DeerBeer commented 7 years ago

@zinid Yes very strange and in reality it is consuming 20g right now

zinid commented 7 years ago

Type the following in erlang shell:

> application:which_applications().
DeerBeer commented 7 years ago

[{stdlib,"ERTS CXC 138 10","2.5"}, {kernel,"ERTS CXC 138 10","4.0"}]

zinid commented 7 years ago

Which means no ejabberd is running. The output should look like this:

> application:which_applications().
[{ejabberd,"ejabberd","17.03.beta-43"},
 {inets,"INETS  CXC 138 49","6.3.3"},
 {iconv,"Fast encoding conversion library for Erlang / Elixir",
        "1.0.3"},
 {esip,"ProcessOne SIP server component in Erlang","1.0.10"},
 {mnesia,"MNESIA  CXC 138 12","4.14.1"},
 {cache_tab,"In-memory cache Erlang / Elixir library",
            "1.0.6"},
 {xmpp,"Erlang/Elixir XMPP parsing and serialization library",
       "1.1.8"},
 {stringprep,"Fast Stringprep Erlang / Elixir implementation",
             "1.0.7"},
 {p1_utils,"Erlang utility modules from ProcessOne","1.0.7"},
 {fast_xml,"Fast Expat-based Erlang / Elixir XML parsing library",
           "1.1.21"},
 {fast_tls,"TLS / SSL OpenSSL-based native driver for Erlang / Elixir",
           "1.0.10"},
 {fast_yaml,"Fast YAML native library for Erlang / Elixir",
            "1.0.8"},
 {ssl,"Erlang/OTP SSL application","8.0.3"},
 {public_key,"Public key infrastructure","1.2"},
 {asn1,"The Erlang ASN1 compiler version 4.0.4","4.0.4"},
 {sasl,"SASL  CXC 138 11","3.0.1"},
 {crypto,"CRYPTO","3.7.1"},
 {lager,"Erlang logging framework","3.2.1"},
 {goldrush,"Erlang event stream processor","0.1.8"},
 {compiler,"ERTS  CXC 138 10","7.0.2"},
 {syntax_tools,"Syntax tools","2.1"},
 {stdlib,"ERTS  CXC 138 10","3.1"},
 {kernel,"ERTS  CXC 138 10","5.1"}]
DeerBeer commented 7 years ago

ps -ef | grep ejab root 49 1 88 Mar02 ? 14:56:58 /usr/lib/erlang/erts-7.0/bin/beam.smp -K true -P 250000 -- -root /usr/lib/erlang -progname erl -- -home /root -- -sname ejabberd -noshell -noinput -noshell -noinput -mnesia dir "//var/lib/ejabberd" -kernel inet_dist_listen_min 4200 inet_dist_listen_max 4210 -ejabberd log_rate_limit 100 log_rotate_count 7 log_rotate_date "$D0" -s ejabberd -smp auto start

DeerBeer commented 7 years ago

[root@prpxmpp06 ejabberd]# ejabberdctl list_cluster ejabberd@prpxmpp07 ejabberd@prpxmpp06 [root@prpxmpp06 ejabberd]# ejabberdctl debug

IMPORTANT: we will attempt to attach an INTERACTIVE shell to an already running ejabberd node. If an ERROR is printed, it means the connection was not successful. You can interact with the ejabberd node if you know how to use it. Please be extremely cautious with your actions, and exit immediately if you are not completely sure.

To detach this shell from ejabberd, press: control+c, control+c


To bypass permanently this warning, add to ejabberdctl.cfg the line: EJABBERD_BYPASS_WARNINGS=true Press return to continue

Eshell V7.0 (abort with ^G) (e38524b4-debug-ejabberd@prpxmpp06)1> application:which_applications(). [{stdlib,"ERTS CXC 138 10","2.5"}, {kernel,"ERTS CXC 138 10","4.0"}]

zinid commented 7 years ago

Well, I really don't know what you're connecting to. This is also another issue, not related to memory consumption. Probably someone else will handle that.

DeerBeer commented 7 years ago

update with this issue. We figured out that our custom module is causing memory leak. That is module for push notifications. I'm using cache_tab component. So maybe that is causing leak. Every client when connects it send his push token and I insert it into cache. Key is resource, value is the token. When user disconnects I store that value in database in table where I hold user-resource- token combinations and delete from cache_tab . when user come online I delete the DB record If someone sends message to user, I read from DB all current users offline tokens and send push to them.

Can I get any advice where I need to be careful when using cache_tab or working with DB not to cause memory leaks.

cromain commented 7 years ago

when using cache_tab you should configure max_size and life_time options. read, for example, mod_caps code to understand how it works.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.