saleyn / erlexec

Execute and control OS processes from Erlang/OTP
https://hexdocs.pm/erlexec/readme.html
Other
527 stars 139 forks source link

Exec-port can hang when receiving SIGCHLD #90

Closed SebCav closed 7 years ago

SebCav commented 7 years ago

Hi, I have an application that uses erlexec to repeatedly run a shell script for long periods of time. This script, under certain conditions, only prints a couple of lines to standard output and exits. The output is received by the application and logged via lager. Under these conditions, my application sometimes fails with a timeout error on a call to exec:run. It seems that exec-port sometimes hangs when receiving a SIGCHLD. For example, in the below backtrace my application received only one line of output, and it seems that exec port hanged while dealing with the second line of output because it has been interrupted by the SIGCHLD.

(gdb) bt
#0  0x00007f006c4471ab in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f006c3d32da in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f006c3d0e15 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f006cc732d8 in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x000000000041446c in __gnu_cxx::new_allocator > >::allocate(unsigned long, void const*) () at /usr/include/c++/4.9/ext/new_allocator.h:104
#5  0x00000000004141e6 in __gnu_cxx::__alloc_traits > > >::allocate(std::allocator > >&, unsigned long) () at /usr/include/c++/4.9/ext/alloc_traits.h:182
#6  0x00000000004139ab in std::_Rb_tree, std::_Select1st >, std::less, std::allocator > >::_M_get_node() ()
    at /usr/include/c++/4.9/bits/stl_tree.h:385
#7  0x0000000000412455 in std::_Rb_tree, std::_Select1st >, std::less, std::allocator > >::_M_create_node(std::pair const&) () at /usr/include/c++/4.9/bits/stl_tree.h:395
#8  0x0000000000411598 in std::_Rb_tree, std::_Select1st >, std::less, std::allocator > >::_M_insert_(std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::pair const&) () at /usr/include/c++/4.9/bits/stl_tree.h:1143
#9  0x0000000000410081 in std::_Rb_tree, std::_Select1st >, std::less, std::allocator > >::_M_insert_unique(std::pair const&) () at /usr/include/c++/4.9/bits/stl_tree.h:1503
#10 0x000000000040e8bd in std::map, std::allocator > >::insert(std::pair const&) () at /usr/include/c++/4.9/bits/stl_map.h:630
#11 0x00000000004051e1 in check_child(int, int) () at c_src/exec.cpp:478
#12 0x0000000000405273 in gotsigchild(int, siginfo_t*, void*) () at c_src/exec.cpp:495
#13 
#14 0x00007f006c3ceb93 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x00007f006c3d0e20 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x00007f006cc732d8 in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007f006ccd2e59 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007f006ccd47d5 in char* std::string::_S_construct(char const*, char const*, std::allocator const&, std::forward_iterator_tag) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007f006ccd4c06 in std::basic_string, std::allocator >::basic_string(char const*, std::allocator const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x000000000040c488 in ei::atom_t::atom_t(char const*) () at c_src/ei++.hpp:321
#21 0x0000000000409fc8 in send_ospid_output(int, char const*, char const*, int) () at c_src/exec.cpp:1700
#22 0x0000000000409239 in process_pid_output(CmdInfo&, int) () at c_src/exec.cpp:1525
#23 0x000000000040e14c in CmdInfo::process_stream_data(int, fd_set*, fd_set*) () at c_src/exec.cpp:420
#24 0x0000000000405fd3 in main () at c_src/exec.cpp:695

My suspect is that the gotsigchild signal handler is calling functions that are not allowed inside a signal handler (I found this with a quick lookup in Google link). If you want I can provide some test code to reproduce the issue.

saleyn commented 7 years ago

Yes, a test case that can reproduce it would be helpful.

SebCav commented 7 years ago

Ok, Below you can find test_script.sh, which is a script to be put in /tmp, and test_exec.erl (it's the first time I file a GitHub issue and I can't find a way to attach source files, so I copy and paste them here...). Open an Erlang shell, start erlexec, call test_exec:start_link() (it's a gen_server, a stripped/adapted version of the code in my application). Then you can call test_exec:run_scripts(1000) to repeatedly call the script 1000 times. The scenario in the real environment is a little different and it takes a day or so to incur the problem, but in my dev system erlexec usually hangs with two or three calls to run_scripts(1000).

test_script.sh:

#!/bin/bash

sleep  .0$[ ( $RANDOM % 10 ) + 1 ]s
echo "This is a test script"
sleep  .0$[ ( $RANDOM % 10 ) + 1 ]s
echo Exit status 12
exit 12

test_exec.erl:

`%%%-------------------------------------------------------------------
%%% @author SebCav
%%% @doc
%%%
%%% @end
%%% Created : 14 Nov 2016
%%%-------------------------------------------------------------------
-module(test_exec).

-behaviour(gen_server).

-export([start_link/0]).
-export([init/1,
         handle_call/3,
         handle_cast/2,
         handle_info/2,
         terminate/2,
         code_change/3,
         run_scripts/1]).

-define(SERVER, ?MODULE).

-record(state, {
          managed_pids::list() %% List of all the processes we manage
         }).

%%--------------------------------------------------------------------
%% @doc
%% Starts the server
%%
%% @spec start_link() -> {ok, Pid} | ignore | {error, Error}
%% @end
%%--------------------------------------------------------------------
start_link() ->
    gen_server:start_link({local, ?SERVER}, ?MODULE, [], []).

%%%===================================================================
%%% gen_server callbacks
%%%===================================================================

%%--------------------------------------------------------------------
%% @private
%% @doc
%% Initializes the server
%%
%% @spec init(Args) -> {ok, State} |
%%                     {ok, State, Timeout} |
%%                     ignore |
%%                     {stop, Reason}
%% @end
%%--------------------------------------------------------------------
init([]) ->
    %exec:start([]),
    {ok, #state{managed_pids = []}}.

run_scripts(0) -> 
    lager:info("[~p:run_scripts] Completed", [?MODULE]),
    ok;

run_scripts(Count) ->
    run_script(),
    timer:sleep(10),
    run_scripts(Count-1).

run_script() ->
    gen_server:cast(?SERVER, start).

%%--------------------------------------------------------------------
%% @private
%% @doc
%% Handling call messages
%%
%% @spec handle_call(Request, From, State) ->
%%                                   {reply, Reply, State} |
%%                                   {reply, Reply, State, Timeout} |
%%                                   {noreply, State} |
%%                                   {noreply, State, Timeout} |
%%                                   {stop, Reason, Reply, State} |
%%                                   {stop, Reason, State}
%% @end
%%--------------------------------------------------------------------
handle_call(_, _From, State) ->
    {reply, ok, State}.

%%--------------------------------------------------------------------
%% @private
%% @doc
%% Handling cast messages
%%
%% @spec handle_cast(Msg, State) -> {noreply, State} |
%%                                  {noreply, State, Timeout} |
%%                                  {stop, Reason, State}
%% @end
%%--------------------------------------------------------------------
handle_cast(start, State) ->
    lager:info("[~p:handle_cast] Start", [?MODULE]),
    Result = exec:run("/tmp/test_script.sh", [stdout, stderr, monitor]),
    lager:debug("[~p:handle_cast] Result=~p", [?MODULE, Result]),
    {noreply, State};

handle_cast(_Msg, State) ->
    lager:error("[~p:handle_cast] unknown!", [?MODULE]),
    {noreply, State}.

%%--------------------------------------------------------------------
%% @private
%% @doc
%% Handling all non call/cast messages
%%
%% @spec handle_info(Info, State) -> {noreply, State} |
%%                                   {noreply, State, Timeout} |
%%                                   {stop, Reason, State}
%% @end
%%--------------------------------------------------------------------

handle_info({Channel, OsPid, Data}, State) when Channel == stdout orelse
                                                Channel == stderr ->
    lager:debug("[~p:handle_info] [~p:~p] ~p", [?MODULE, OsPid, Channel, binary_to_list(Data)]),
    {noreply, State};

handle_info({'DOWN', _Ref, process, Pid, _ExitStatus}, State) ->
    lager:debug("[~p:handle_info] DOWN: ~p", [?MODULE, Pid]),
    {noreply, State}.

%%--------------------------------------------------------------------
%% @private
%% @doc
%% This function is called by a gen_server when it is about to
%% terminate. It should be the opposite of Module:init/1 and do any
%% necessary cleaning up. When it returns, the gen_server terminates
%% with Reason. The return value is ignored.
%%
%% @spec terminate(Reason, State) -> void()
%% @end
%%--------------------------------------------------------------------
terminate(_Reason, _State) ->
    ok.

%%--------------------------------------------------------------------
%% @private
%% @doc
%% Convert process state when code is changed
%%
%% @spec code_change(OldVsn, State, Extra) -> {ok, NewState}
%% @end
%%--------------------------------------------------------------------
code_change(_OldVsn, State, _Extra) ->
    {ok, State}.
saleyn commented 7 years ago

What is your platform?

SebCav commented 7 years ago

Both production and development environment I used for testing are Debian 7 (uname output: Linux gsd-00012 3.16-2-amd64 #1 SMP Debian 3.16.3-2 (2014-09-20) x86_64 GNU/Linux).

saleyn commented 7 years ago

Check out the HEAD. I believe it fixes this issue.

SebCav commented 7 years ago

Yes I confirm that it is fixed with the test code. Now I'm going to put it in one of our production systems to confirm that the "real" problem goes away too. Since it used to happen in one/two days, I'll let you know in the weekend if everything works. Thank you, great work ;)

SebCav commented 7 years ago

Working in production system too, just checked after several days of consecutive uptime