processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/en/ejabberd/
Other
6.11k stars 1.51k forks source link

Segmentation fault on Ejabberd 16.8 #1420

Closed santiagopoli closed 6 years ago

santiagopoli commented 7 years ago

Hi,

I'm having constant crashes (every 3-6 hours on every node) in ejabberd. I'm currently using ejabberd 16.8 but the issue happened with older versions as well. Architecture-wise, I have four ejabberd nodes in a cluster. I'm also using redis as the session manager.

Fortunately, the crashes leave a core dump which I have analysed without luck.

Here is the output, step by step (I've highlighted each command, followed by its output):

gdb xavier/rel/xavier/erts-8.1.1/bin/beam.smp -core crashes/core.5_scheduler.14 -d otp_src_19.1/erts/emulator

Program terminated with signal SIGSEGV, Segmentation fault.
#0  do_minor (p=0x7f88582a04e0, live_hf_end=<optimized out>, mature=<optimized out>, mature_size=672, new_sz=610, objv=<optimized out>, nobj=1) at beam/erl_gc.c:1401
1401            val = *ptr;

(gdb) print ptr

$1 = <optimized out>

(gdb) print gval

$3 = 140221396262226

(gdb) source otp_src_19.1/erts/etc/unix/etp-commands.in

%---------------------------------------------------------------------------
% Use etp-help for a command overview and general help.
%
% To use the Erlang support module, the environment variable ROOTDIR
% must be set to the toplevel installation directory of Erlang/OTP,
% so the etp-commands file becomes:
%     $ROOTDIR/erts/etc/unix/etp-commands
% Also, erl and erlc must be in the path.
%---------------------------------------------------------------------------
etp-set-max-depth 20
etp-set-max-string-length 100
--------------- System Information ---------------
OTP release: 19
ERTS version: 8.1.1
Compile date: Tue Nov 15 00:12:21 2016
Arch: x86_64-unknown-linux-gnu
Endianness: Little
Word size: 64-bit
HiPE support: yes
SMP support: yes
Thread support: yes
Kernel poll: Supported and used
Debug compiled: no
Lock checking: no
Lock counting: no
Node name: xavier@jb4.apalabrados.com
Number of schedulers: 16
Number of async-threads: 100
--------------------------------------------------

(gdb) etp-process-info p

  Pid: <0.21391.423>
  State: on-heap-msgq | garbage-collecting | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal
  Current function: stringprep:resourceprep/1
  CP: #Cp<ejabberd_hooks:safe_apply/3+0xc0>
  I: #Cp<gen:do_call/4+0x198>
  Heap size: 610
  Old-heap size: 1598
  Mbuf size: 0
  Msgq len: 1 (inner=1, outer=0)
  Parent: <0.1696.0>
  Pointer: (Process *) 0x7f88582a04e0

(gdb) etp-stacktrace p

% Stacktrace (77): #Cp<ejabberd_hooks:safe_apply/3+0xc0>.
#Cp<ejabberd_hooks:run_fold1/4+0x1430>.
#Cp<ejabberd_router:do_route/3+0x4f0>.
#Cp<0x64e0ebf8>.
#Cp<ejabberd_router_multicast:'-do_route_normal/3-lc$^0/1-0-'/3+0x90>.
#Cp<0x59edb4f8>.
#Cp<ejabberd_c2s:terminate/3+0x10c0>.
#Cp<p1_fsm:terminate/8+0x130>.
#Cp<proc_lib:init_p_do_apply/3+0x58>.
#Cp<terminate process normally>.

(gdb) etp-stackdump p

% Stackdump (77): #Cp<ejabberd_hooks:safe_apply/3+0xc0>.
#Cp<ejabberd_hooks:run_fold1/4+0x1430>.
#Catch<2533>.
#Cp<ejabberd_router:do_route/3+0x4f0>.
[].
[].
[].
[].
[].
[].
[].
filter_packet.
{{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>},{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>},{xmlel,#HeapBinary<0x8,0x73657270>,[{#HeapBinary<0x4,0x65707974>,#HeapBinary<0xb,0x76616e75,0xcc656c62>}],[]}}.
[].
#Cp<0x64e0ebf8>.
[].
[].
[].
[].
[].
[].
[].
[].
#Cp<ejabberd_router_multicast:'-do_route_normal/3-lc$^0/1-0-'/3+0x90>.
[].
[].
[].
[].
[].
{xmlel,#HeapBinary<0x8,0x73657270>,[{#HeapBinary<0x4,0x65707974>,#HeapBinary<0xb,0x76616e75,0xcc656c62>}],[]}.
{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>}.
{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>}.
#Catch<2410>.
#Cp<0x59edb4f8>.
{xmlel,#HeapBinary<0x8,0x73657270>,[{#HeapBinary<0x4,0x65707974>,#HeapBinary<0xb,0x76616e75,0xcc656c62>}],[]}.
{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>}.
[].
#Cp<ejabberd_c2s:terminate/3+0x10c0>.
[].
[].
[].
[].
[].
{xmlel,#HeapBinary<0x8,0x73657270>,[{#HeapBinary<0x4,0x65707974>,#HeapBinary<0xb,0x76616e75,0xcc656c62>}],[]}.
[{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0>}].
#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>.
{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>}.
#Catch<2431>.
#Cp<p1_fsm:terminate/8+0x130>.
[].
[].
[].
[].
[].
[].
[].
[].
[].
inactive.
{{1481,631321,949637},<0.21391.423>}.
#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>.
#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>.
#HeapBinary<0x8,0x33333231>.
0.
{state,{socket_state,gen_tcp,#Port<0.1144832>,<0.21390.423>},ejabberd_socket,#Ref<0.0.1048580.225209>,false,#HeapBinary<0x13,0x39393237,0x39383731,0x5c313931>,undefined,c2s,c2s_shaper,false,false,false,false,[verify_none,compression_none],true,{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>},#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,{{1481,631321,949637},<0.21391.423>},...}.
ejabberd_socket.
{socket_state,gen_tcp,#Port<0.1144832>,<0.21390.423>}.
#Cp<proc_lib:init_p_do_apply/3+0x58>.
#Catch<1665>.
[].
{state,{socket_state,gen_tcp,#Port<0.1144832>,<0.21390.423>},ejabberd_socket,#Ref<0.0.1048580.225209>,false,#HeapBinary<0x13,0x39393237,0x39383731,0x5c313931>,undefined,c2s,c2s_shaper,false,false,false,false,[verify_none,compression_none],true,{jid,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>,#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d3245,0x30323346>},#HeapBinary<0x8,0x33333231>,#HeapBinary<0x12,0x612e626a,0x64617262,0x382d6d6f>,#HeapBinary<0x27,0x455f3261,0x2d424541,0x3443342d,0x412d324---Type <return> to continue, or q <return> to quit---
5,0x30323346>,{{1481,631321,949637},<0.21391.423>},...}.
session_established.
ejabberd_c2s.
{'$gen_event',{xmlstreamerror,#HeapBinary<0x15,0x204c4d58,0x6920617a,0x6962206f>}}.
<0.21391.423>.
normal.
#Cp<terminate process normally>.
#Catch<162>.

(gdb) etpf-stackdump p

% Stackdump (77): #Cp<ejabberd_hooks:safe_apply/3+0xc0>.
#Cp<ejabberd_hooks:run_fold1/4+0x1430>.
#Catch<2533>.
#Cp<ejabberd_router:do_route/3+0x4f0>.
[].
[].
[].
[].
[].
[].
[].
filter_packet.
<etpf-boxed 0xd7cb6222>.
[].
#Cp<0x64e0ebf8>.
[].
[].
[].
[].
[].
[].
[].
[].
#Cp<ejabberd_router_multicast:'-do_route_normal/3-lc$^0/1-0-'/3+0x90>.
[].
[].
[].
[].
[].
<etpf-boxed 0x5b6c8aa2>.
<etpf-boxed 0xd7cb6242>.
<etpf-boxed 0xd7188fea>.
#Catch<2410>.
#Cp<0x59edb4f8>.
<etpf-boxed 0x5b6c8aa2>.
<etpf-boxed 0xd7188fea>.
[].
#Cp<ejabberd_c2s:terminate/3+0x10c0>.
[].
[].
[].
[].
[].
<etpf-boxed 0x5b6c8aa2>.
<etpf-cons 0xd7cb6281>.
<etpf-boxed 0xd71891fa>.
<etpf-boxed 0xd7188fea>.
#Catch<2431>.
#Cp<p1_fsm:terminate/8+0x130>.
[].
[].
[].
[].
[].
[].
[].
[].
[].
inactive.
<etpf-boxed 0xd71892d2>.
<etpf-boxed 0xd7189222>.
<etpf-boxed 0xd71891fa>.
<etpf-boxed 0xd71891e2>.
0.
<etpf-boxed 0xd718968a>.
ejabberd_socket.
<etpf-boxed 0xd718925a>.
#Cp<proc_lib:init_p_do_apply/3+0x58>.
#Catch<1665>.
[].
<etpf-boxed 0xd718968a>.
session_established.
ejabberd_c2s.
<etpf-boxed 0xd71899ea>.
<0.21391.423>.
normal.
#Cp<terminate process normally>.
#Catch<162>.

TL;DR I assume the problem is happening in ejabberd_router_multicast (which may have to do with the intra-node communication) but I don't know how to proceed. The ejabberd_c2s. {'$gen_event',{xmlstreamerror,#HeapBinary<0x15,0x204c4d58,0x6920617a,0x6962206f>}}. line seems important.

Thanks in advance!

zinid commented 7 years ago

Do you remember since what version you started to get crashes?

zinid commented 7 years ago

What version of expat is ejabberd compiled against?

santiagopoli commented 7 years ago

We are using expat 2.1.0. 2.1.0-6+deb8u3 to be exact

dpkg -s expat

Package: expat
Status: install ok installed
Priority: optional
Section: text
Installed-Size: 42
Maintainer: Laszlo Boszormenyi (GCS) <gcs@debian.org>
Architecture: amd64
Version: 2.1.0-6+deb8u3
Depends: libc6 (>= 2.14), libexpat1 (>= 2.0.1)
Description: XML parsing C library - example application
 This package contains xmlwf, an example application of expat, the C
 library for parsing XML.  The arguments to xmlwf are one or more
 files which are each to be checked for XML well-formedness.
Homepage: http://expat.sourceforge.net

dpkg -s libexpat1

Package: libexpat1
Status: install ok installed
Priority: optional
Section: libs
Installed-Size: 347
Maintainer: Laszlo Boszormenyi (GCS) <gcs@debian.org>
Architecture: amd64
Multi-Arch: same
Source: expat
Version: 2.1.0-6+deb8u3
Depends: libc6 (>= 2.14)
Pre-Depends: multiarch-support
Conflicts: wink (<= 1.5.1060-4)
Description: XML parsing C library - runtime library
 This package contains the runtime, shared library of expat, the C
 library for parsing XML. Expat is a stream-oriented parser in
 which an application registers handlers for things the parser
 might find in the XML document (like start tags).
Homepage: http://expat.sourceforge.net
santiagopoli commented 7 years ago

I have just found something. If I run

dpkg -s lib64expat1

dpkg-query: package 'lib64expat1' is not installed and no information is available
Use dpkg --info (= dpkg-deb --info) to examine archive files,
and dpkg --contents (= dpkg-deb --contents) to list their contents.

Can the problem be related to the fact we are not installing the amd64 version of expat? (We are using a 64 bit system)

prefiks commented 7 years ago

That libexpat1 package is already 64 bit version

santiagopoli commented 7 years ago

My bad, it says Architecture: amd64 right in the middle :(

zinid commented 7 years ago

@santiagopoli Do you remember since what version you started to get crashes?

santiagopoli commented 7 years ago

No, but I got this same crash using Ejabberd 16.6. It didn't happen to us with Ejabberd 15 (but we were having a lot of other crashes -due to our code- in that time, so maybe it happened as well)

BTW, we're using Ejabberd as an Elixir dependency.

santiagopoli commented 7 years ago

Could this be related to https://bugs.erlang.org/browse/ERL-304 ?

zinid commented 7 years ago

Yes, this is probably the same bug.

shanjianping commented 7 years ago

santiagopoli, did you resolve this crash? Recently we met the very similar crash issue as well.

zinid commented 7 years ago

Guys, we're reviewing our C code, please be patient (we got plenty of the cores like this one).

santiagopoli commented 7 years ago

No, we haven't solved it yet, but we think it only happens within a cluster. We've tried to reproduce this bug with a single, larger node and we didn't have this error (but can be just pure luck). Having 4 interconnected nodes produces this crash 2+ times a day. Notice we often have 100k users connected at the same time.

shanjianping commented 7 years ago

Have you guys figured out the root cause? We've temporarily resolved this issue by rollback fast_xml to p1_xml. Hopeful this info could be a little useful for others.

zinid commented 7 years ago

@shanjianping yes, that's important info, thanks

tsaqova commented 6 years ago

I'm experiencing the same symptoms with this issue. If you don't mind me asking:

zinid commented 6 years ago

The root case is now covered by the test here: https://github.com/processone/fast_xml/commit/6cfd311c6fa6ed94de7b9bdb255751c26e66094b Without the patch you would get a segfault at line 406. Simply put, if you send more data after a server generates xml-too-big error, it would segfault, because a server attempts to reuse freed structures. This also happens in non-cluster environment, it's pretty much reproduceable: revert the patch and run the test.

digz6666 commented 6 years ago

Following trick is fixed my problem on my fresh ejabbard install on ubuntu 16.04.03

https://askubuntu.com/questions/865578/is-there-a-way-to-override-a-hat-child-profile-in-an-apparmor-local-override-fil

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.