iamaleksey / oserl

A fork of http://sourceforge.net/projects/oserl/ with a few patches.
http://github.com/iamaleksey/oserl
27 stars 47 forks source link

gen_esme:bind_transceiver/3 issue #1

Closed elincognito closed 13 years ago

elincognito commented 13 years ago

Im facing an issue with a trx connection, if the incomming and outgoing (submit_sm/deliver_sm) troughtput is high enough the gen_esme server crash's with a , which takes about 2 minutes to crash, i need this for a production system for next month :s, and my SMSC provider only supports trx connections, i was using tr and rx connections with no problem with other SMSC's, also i noticed that this ony happens when receving and sending messages in simultaneous, if im only sending, or receiving this issues does not occur. Im using SMPPSim from logica as a simulator. Also oserl supports async smpp sessions ? The doc is very vague on this and also doesn't mention the window width, if its configurable, only rps.

looking at the process_info of the gen_esme connection process i notice that the server is "loked" in some operation, here follows the sasl error report and process_info

ERROR REPORT <0.160.0> 2011-04-29 14:31:25

* Generic server <0.160.0> terminating * Last message in was {'EXIT',<0.158.0>, {timeout, {gen_server,call, [<0.158.0>, {{submit_sm, [{short_message, [0,72,0,105,0,32,0,116,0,104,0,101,0,114,0,101, 0,32,0,84,0,82,0,88]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,1}, {source_addr_npi,1}, {registered_delivery,0}, {priority_flag,0}, {schedule_delivery_time,[]}, {validity_period,[]}]}, [{uid, "8723648C5319FF0AB78864A5A6BD94A8F66DE233"}]}, 120000]}}} * When Server state == {st,<0.158.0>,undefined,12751,1083, {1304,83762,900153}} * Reason for termination == \ {timeout, {gen_server,call, [<0.158.0>, {{submit_sm, [{short_message, [0,72,0,105,0,32,0,116,0,104,0,101,0,114,0,101,0,32,0, 84,0,82,0,88]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,1}, {source_addr_npi,1}, {registered_delivery,0}, {priority_flag,0}, {schedule_delivery_time,[]}, {validity_period,[]}]}, [{uid,"8723648C5319FF0AB78864A5A6BD94A8F66DE233"}]}, 120000]}} ok (smpp_con@arsenio-desktop)135> rb:show(2).

CRASH REPORT <0.160.0> 2011-04-29 14:31:25

Crashing process
initial_call {cl_queue_srv,init,['Argument__1']} pid <0.160.0> registered_name [] error_info {exit, {timeout, {gen_server,call, [<0.158.0>, {{submit_sm, [{short_message, [0,72,0,105,0,32,0,116,0,104,0,101,0,114,0, 101,0,32,0,84,0,82,0,88]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,1}, {source_addr_npi,1}, {registered_delivery,0}, {priority_flag,0}, {schedule_delivery_time,[]}, {validity_period,[]}]}, [{uid,"8723648C5319FF0AB78864A5A6BD94A8F66DE233"}]}, 120000]}}, [{gen_server,terminate,6},{proc_lib,init_p_do_apply,3}]} ancestors [smpp_connection,smpp_connector_sup,<0.51.0>] messages [] links [] dictionary [] trap_exit true status running heap_size 17711 stack_size 24 reductions 59200658

ok (smpp_con@arsenio-desktop)136> erlang:process_info(erlang:whereis(smpp_connection)). [{registered_name,smpp_connection}, {current_function,{gen,do_call,4}}, {initial_call,{proc_lib,init_p,5}}, {status,waiting}, {message_queue_len,40}, {messages,[{'$gen_call',{<0.10847.4>,#Ref<0.0.43.198005>}, {{submit_sm,[{short_message,[0,72,0,105,0,32,0,116,0,104,0|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,1}, {source_addr_npi,1}, {registered_delivery,0}, {priority_flag,0}, {schedule_delivery_time,[]}, {validity_period,[]}]}, [{uid,"F68E1631C5756718C4D2A0F7961069BD992D208"}]}}, {'$gen_call',{<0.10848.4>,#Ref<0.0.43.198013>}, {{submit_sm,[{short_message,[0,72,0,105,0,32,0,116,0,104|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,1}, {source_addr_npi,1}, {registered_delivery,0}, {priority_flag,0}, {schedule_delivery_time,[]}, {validity_period,...}]}, [{uid,"E87007766E3D5E5AD6BB68835CBBEEAD5FB19237"}]}}, {'$gen_call',{<0.10841.4>,#Ref<0.0.43.198016>}, {handle_deliver_sm,{5,0,22410, [{short_message,"RE 2856249 19930922"}, {sm_default_msg_id,0}, {data_coding,0}, {replace_if_present_flag,0}, {registered_delivery,0}, {validity_period,[]}, {schedule_delivery_time,[]}, {priority_flag,...}, {...}|...]}}}, {'$gen_call',{<0.10849.4>,#Ref<0.0.43.198019>}, {{submit_sm,[{short_message,[0,72,0,105,0,32,0,116|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,1}, {source_addr_npi,1}, {registered_delivery,0}, {priority_flag,...}, {...}|...]}, [{uid,"3E0CBFEA73A8B709ACF1EE7D8FA8898E9859DD71"}]}}, {'$gen_call',{<0.10845.4>,#Ref<0.0.43.198021>}, {{submit_sm,[{short_message,[0,72,0,105,0,32,0|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,1}, {source_addr_npi,1}, {registered_delivery,...}, {...}|...]}, [{uid,"C9352FD329A7C5FED4ECC9725BA52CE433E0DDB3"}]}}, {'$gen_call',{<0.10850.4>,#Ref<0.0.43.198036>}, {{submit_sm,[{short_message,[0,72,0,105,0,32|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,1}, {source_addr_npi,...}, {...}|...]}, [{uid,"5B07AFCB6EA1BE25CA1CCEC8925C6A1DA2AAB0F4"}]}}, {'$gen_call',{<0.10851.4>,#Ref<0.0.43.198041>}, {{submit_sm,[{short_message,[0,72,0,105,0|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,1}, {dest_addr_npi,...}, {...}|...]}, [{uid,"962DA498590F67AF868DDAC9F95A72E7D6640C17"}]}}, {'$gen_call',{<0.10852.4>,#Ref<0.0.43.198046>}, {{submit_sm,[{short_message,[0,72,0,105|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,"7337"}, {data_coding,8}, {dest_addr_ton,...}, {...}|...]}, [{uid,"8E1E1D7F548C82D443C98FDB6E4FFD50205DBB9F"}]}}, {'$gen_call',{<0.10853.4>,#Ref<0.0.43.198051>}, {{submit_sm,[{short_message,[0,72,0|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,[...]}, {data_coding,...}, {...}|...]}, [{uid,"39974B8D22FEC1158E9B34D669BF781CC5794205"}]}}, {'$gen_call',{<0.10854.4>,#Ref<0.0.43.198056>}, {{submit_sm,[{short_message,[0,72|...]}, {destination_addr,"351913768304"}, {source_addr_ton,5}, {source_addr,...}, {...}|...]}, [{uid,"8E8F476574C69BDBB2488F71E1876892BD320FB6"}]}}, {'$gen_call',{<0.10855.4>,#Ref<0.0.43.198061>}, {{submit_sm,[{short_message,[0|...]}, {destination_addr,[...]}, {source_addr_ton,...}, {...}|...]}, [{uid,"2A7DF716CED714200404E10324A90DCA183A49FD"}]}}, {'$gen_call',{<0.10856.4>,#Ref<0.0.43.198066>}, {{submit_sm,[{short_message,[...]}, {destination_addr,...}, {...}|...]}, [{uid,"4969C8F1E1D21BA7604D9393B3624A6B1BA7B649"}]}}, {'$gen_call',{<0.10857.4>,#Ref<0.0.43.198071>}, {{submit_sm,[{short_message,...},{...}|...]},[{uid,[...]}]}}, {'$gen_call',{<0.10858.4>,#Ref<0.0.43.198076>}, {{submit_sm,[{...}|...]},[{uid,...}]}}, {'$gen_call',{<0.10859.4>,#Ref<0.0.43.198081>}, {{submit_sm,[...]},[{...}]}}, {'$gen_call',{<0.10860.4>,#Ref<0.0.43.198086>}, {{submit_sm,...},[...]}}, {'$gen_call',{<0.10861.4>,#Ref<0.0.43.198091>},{{...},...}}, {'$gen_call',{<0.10862.4>,...},{...}}, {'$gen_call',{...},...}, {'$gen_call',...}, {...}|...]},

iamaleksey commented 13 years ago

Yes, I know. I talked to the original author (Enrique Marcote Peña) about this issue last year.

"We implemented all SMPP for completion, but our requirement was to build a transmitter ESME (and that is mainly in what we focused on so far)."

This is why rx/trx support is not as good. And trx connections are the worst. It can easily be locked if the session calls CallbackMod:handle_operation/2 and another process calls, say, submit_sm/2. Both are synchronous.

To fix the rx/trx issue properly a big rewrite of sessions is needed. But I had this problem last year and somewhat fixed it by adding async (SHA: 8c5bf05338f7bc9f6418fa523ea160632c1568a7). So, return 'noreply' from handle_operation/2 and then call gen_esme_session:reply/2 with the reply.

If this isn't enough, have a look at https://github.com/essiene/smpp34/

elincognito commented 13 years ago

Thanks for the reply iamaleksey i will try it next monday when i get back to work.

So on my gen_esme behaviour implementation callbacks i should return noreply for handle_operation, and later reply with gen_esme_session:reply/2 is that correct ? In this way the session will work in an async fashion ? And what about window width support is that "concept" implemented in oserl ?

iamaleksey commented 13 years ago

My use cases are mostly transmitters as well, so I implement window width enforcement in my callbacks. But it definitely is something that should be supported natively by esme/mc sessions.

elincognito commented 13 years ago

Hi there again and sorry to be bothering you again, i seem to be doing something wrong here is a snipet of what i've done:

[code]

handle_deliver_sm(Pdu, From = {Pid, Ref}, St) ->

spawn_link(
    fun() -> 
        format(St#st.silent, "Deliver SM: ~p~n", [Pdu]),
        Reply = {ok, []},
        SeqNum = smpp_operation:get_value(sequence_number, Pdu),
        io:format("Pid ~p info: ~p~n", [Pid, process_info(Pid)]),
        gen_esme_session:reply(Pid, {SeqNum, Reply})
    end),

{noreply, St#st{msg_id = St#st.msg_id + 1, recv = St#st.recv + 1}}.

[/code]

however this doesn't seem to be working as gen_esme_session:handle_sync_event is not being called ? >.<

elincognito commented 13 years ago

Ok my misunderstanding now its : {reply, noreply, St#st{msg_id = St#st.msg_id + 1, recv = St#st.recv + 1}}.

however i still get the timeout(lock) issue :s

iamaleksey commented 13 years ago

Based on your last comment, you are still making a gen_server call from handle_operation, thus not fixing anything. You should just return noreply and make a non-blocking cast instead of call. Here is a snippet of some working code:

%% Handle SMSC-issued deliver_sm operation.
handle_operation(Client, {deliver_sm, {_CmdId, _Status, SeqNum, Body}}) ->
    gen_server:cast(Client, {handle_deliver_sm, SeqNum, Body}),
    noreply;
elincognito commented 13 years ago

Before anything else thanks alot for the replies !

Ok i was using the gen_esme behaviour so i changed it to do the cast and return the noreply as :

[code] %%On oserl gen_esme: handle_operation(SrvRef, {deliver_sm, Pdu}) -> gen_server:cast(SrvRef, {handle_deliver_sm, Pdu}), noreply.

handle_cast({handle_deliver_sm, Pdu}, St) ->

io:format("ST MOD : ~p~n", [St#st.mod]),
R = pack((St#st.mod):handle_deliver_sm(Pdu, self(), St#st.mod_st), St),
io:format("Pack: ~p~n", [R]),
R;

[/code]

And on my callback module (called smpp_connection) [code] handle_deliver_sm(Pdu, From, St) ->

io:format("Received PPP~n"),
pdu_submitter:buffer_pdu(Pdu),  

?DEBUG_LOG("received pdu: ~p", [Pdu]),

% Reply = {ok, [{message_id, integer_to_list(St#st.msg_id)}]}, %% Reply = {ok, [{congestion_state, 1}]},

 Reply = {ok, []},
    SeqNum = smpp_operation:get_value(sequence_number, Pdu),
    %io:format("Pid ~p info: ~p~n", [Pid, process_info(Pid)]),
gen_esme_session:reply(From, {SeqNum, Reply}),

{reply, Reply, St#st{msg_id = St#st.msg_id + 1, recv = St#st.recv + 1}}.

[/code]

doing a erlang:process_info(erlang:whereis(smpp_connection)), i see the message queue of my smpp_connection.

[code] {messages,[{'$gen_sync_all_state_event',{<0.1536.0>,

Ref<0.0.0.10076>},

                                     {reply,{20328,{ok,[]}}}},
        {'$gen_call',{<0.1554.0>,#Ref<0.0.0.10097>},
                     {{submit_sm,[{short_message,[5,0,3,1,3,1,0,78,0,79|...]},
                                  {esm_class,64},
                                  {destination_addr,"913768304"},
                                  {source_addr_ton,5},
                                  {source_addr,"3838"},
                                  {data_coding,8},
                                  {dest_addr_ton,1},
                                  {dest_addr_npi,1},
                                  {source_addr_npi,1},
                                  {registered_delivery,0},
                                  {priority_flag,0},
                                  {schedule_delivery_time,...},
                                  {...}]},
                      [{uid,"35137F7F55AE642A8A70EC89A85AE35025EBC203"}]}},
        {'$gen_cast',{handle_deliver_sm,{5,0,20330,
                                         [{short_message,"RE 2801491 19741022"},
                                          {sm_default_msg_id,0},
                                          {data_coding,0},
                                          {replace_if_present_flag,0},
                                          {registered_delivery,0},
                                          {validity_period,[]},
                                          {schedule_delivery_time,[]},
                                          {priority_flag,0},
                                          {protocol_id,...},
                                          {...}|...]}}},
        {'$gen_cast',{handle_deliver_sm,{5,0,20332,
                                         [{short_message,"RE 1314745 19430416"},
                                          {sm_default_msg_id,0},
                                          {data_coding,0},
                                          {replace_if_present_flag,0},
                                          {registered_delivery,0},
                                          {validity_period,[]},
                                          {schedule_delivery_time,[]},
                                          {priority_flag,...},
                                          {...}|...]}}}]}

[/code]

I am doing bad the gen_esme_session:reply ?

iamaleksey commented 13 years ago

You seem to be using gen_esme - not gen_esme_session directly. If so then I can't help you with your issue. If not then publish the whole module in a gist - I'll have a look,

elincognito commented 13 years ago

I've pasted here http://pastebin.com/Rnr7NTKF a simplified version which i have used and the error occurs in this version,

so besides this the only change i've attempted was in the gen_esme.erl code (all cloned from your repo for the nice patches you have :) )

also i've tried to change the gen_esme.erl as you've said and changed the handle_operation to

[code]

handle_operation(SrvRef, {deliver_sm, Pdu}) -> gen_server:cast(SrvRef, {handle_deliver_sm, Pdu}),

noreply.

[/code]

and added a match for it in gen_esme.erl

[code]

handle_cast({handle_deliver_sm, Pdu}, St) ->

io:format("ST MOD : ~p~n", [St#st.mod]),
R = pack((St#st.mod):handle_deliver_sm(Pdu, self(), St#st.mod_st), St),
  {noreply, St};

[/code]

The reason for the use of gen_esme instead of gen_esme_session, was pretty much the persistent queues and throttling control given by gen_esme.

Once more thanks for your precisous help !

iamaleksey commented 13 years ago

I see. You didn't really fix the problem, you just moved it. I'm afraid that adding async support to gen_esme would take more that what you've done. Can't help you here, sorry. I still suggest using gen_esme_session directly - that'll give you a lot more control and you'll still be able to implement persistent queues and throttling yourself.

elincognito commented 13 years ago

iamaleksey, thanks alot for your help. I decided to go with gen_esme_session directly and implemented the most basicly possible trx session (with a gen_server since i was suspicious that the problem could rely in gen_fsm usage in gen_esme togheter with gen_fsm in gen_esme_session), and strangly for now with a stress test no lock occured, so now im going to take a look in gen_esme, to implement persistent queuing and throtling, it still remains the window with issue and how to combine that with "gen_esme" queueing. Any tip on that ? Im thinking on having the queue take messages at the desired troughtput per second, and on the gen_server have a window_width_max and current_win_width, increasing one when a submit_sm is done and decrease when i get a reply from the smsc, when the window is full just enqueue the message back on queue, and maybe pause the queue for a second or so. Do you see any possible issues on that implementation ? And once more thanks alot for your help, if by any chance the lock issue appear again i will come back to you :)

iamaleksey commented 13 years ago

Well, if you only need to force a window size for submit_sm then it's easy. Just go with your solution. Maybe block the client user if the window is full (if submit is synchronous on that level). I used something like this as well.

But if you need to enforce a window for deliver_sm then there is little you can do with the current session implementation.

Good luck (:

elincognito commented 13 years ago

I only need to enforce a window for submit_sm, Btw i have currenlty implemented the trx solution with gen_esme_session directly and until now i found no deadlock, lets hope it keeps like this.

Another topic about gen_esme_session in sourceforge mpique says that someone a spawn_link was removed from gen_esme_session and gen_mc_session handle_peer_operation, could that be possibly related ? To the deadlock issue i found ? . https://sourceforge.net/projects/oserl/forums/forum/350016/topic/3910746