Deadlock in getSessionState() - breaks usage as client tranceiver mode and as server

GoogleCodeExporter commented 8 years ago

The current jsmpp trunk has a major deadlock issue that is apparent when using 
the library as client in transceiver mode or when using to implement a smpp 
server.

Pseudo-code flow that demonstrates this:
1. Server opens listener and accept bind for transceiver (or receiver).
2. During accept, server launches a "runnable" whose job is to send deliver_sm 
to client (using the SMPPServerSession instance passed to the runnable ctor).
3. Everything fine and dandy, client sending submit_sm (and others), server 
sending deliver_sm to client.
4. Client decides to unbind, server receives unbind and eventually calls 
AbstractSessionContext.unbound() which is synchronized, entering lock on the 
SMPPServerSession object (which inherits from it).
5. unbound() calls changeState() which eventually calls fireStateChanged() and 
enters into the server code (listener).
6. In the callback, server waits for DLR runnable to stop, how does DLR 
runnable know session is closing ? Runnable polls getSessionState() in loop 
before sending each deliver_sm (by calling 
SMPPServerSession.deliverShortMessage).
7. getSessionState is also marked with "synchronized", so now runnable needs to 
enter lock on same instance, which is already locked by unbound()
8. unbound waiting for callback and callback waiting for DLR runnable to exit, 
DLR runnable needs to lock the same instance to getSessionState in order to 
exit.
9. Result: Deadlock, server freeze on client unbind !

Solution:
Remove synchronized from all method that reach changeState, there is no need to 
serialize access for the stateProcessor field. All that is needed is memory 
barrier to make sure threads see the changed state processor immediately. The 
volatile keyword on this field in SMPPSessionContext and 
SMPPServerSessionContext accomplishes this.

This patch fixes the deadlock in my code and I guess in many people's as well. 
Please review and include this, and thanks for reading this long report :-).

Original issue reported on code.google.com by cpmdxe...@gmail.com on 27 Jun 2011 at 12:40

Attachments:

deadlock_fix.patch

GoogleCodeExporter commented 8 years ago

Hi,

I noticed the same deadlock in a unit test and it was due to a synchronous 
method call from a SessionStateListener registered to my sms receiver. 
I didn't apply the patch yet because I didn't really need a 
sessionStateListener.

The problem I have now is that in my JBoss-Spring environment (where I have 2 
message driven beans using respectively an sms sender-only and an sms 
receiver-only, both built on top of jsmpp). I am not sure whether the problem 
is conductible to this deadlock or if it's some kind of infinite loop in the 
SMPPSession. 
The result is that after more or less 15h of activity (not systematically, but 
it happens at least once per day) the server CPU goes to 100% and it stops 
logging. The last visible lines of log are always from the SMPPSession class:

2011-10-13 08:55:39,435 INFO  [org.jsmpp.session.SMPPSession] Thread-85954 
PDUReaderWorker stop
2011-10-13 08:55:39,468 INFO  [org.jsmpp.session.SMPPSession] WorkManager(2)-3 
Connected
2011-10-13 08:55:39,471 INFO  [org.jsmpp.session.SMPPSession] Thread-87055 
Starting PDUReaderWorker with processor degree:3 ...

Basically it happens during an operation of rebind (session.unbindAndClose(), 
session.connectAndBind()), which I do after every 5 minutes of idle time (idle 
time is computed from the last received message).

The part of the code that looks a bit "smelly" is the run() method in the 
PDUReaderWorker:

@Override
public void run() {
logger.info("Starting PDUReaderWorker with processor degree:{} ...", 
getPduProcessorDegree());
  while (isReadPdu()) {
    readPDU();
  }
  close();
  executorService.shutdown();
  logger.info("PDUReaderWorker stop");
}

In the readPDU() method, SocketTimeoutExceptions are caught without being 
logged, but the notifyNoActivity() method is called. In case the session is in 
a state != bound, nothing really happens and the readPDU() method is called 
again.
I don't know if that scenario is possible, but in case the session - for any 
reason - caused all the time a SocketTimeoutException and it's state was != 
bound we would have an infinite loop here. What do you think? Would that be 
possible?

Is actually anybody maintaining the code of this project anymore?

Cheers and thanks for any help!
Andrea

Original comment by rapow...@gmail.com on 14 Oct 2011 at 10:14

GoogleCodeExporter commented 8 years ago

The problem really boils down to that due to that every method in the 
AbstractSessionContext and its children are marked synchronized, it is not safe 
to call anything from the session object from another thread (including 
unbinding or getSessionState itself !).

This method of global locking is unnecessary. Try the patch, it is applicable 
to the library in general, not to my own specific use of it.

Original comment by cpmdxe...@gmail.com on 14 Oct 2011 at 11:12

GoogleCodeExporter commented 8 years ago

Ok I will try your patch and let you know if that worked! Thanks for the quick 
answer :)

Original comment by rapow...@gmail.com on 14 Oct 2011 at 11:14

GoogleCodeExporter commented 8 years ago

Hi, sorry was trying to apply the patch and I noticed that it doesn't refer to 
the latest release.. did you apply it to trunk? Is it a stable version?

Original comment by rapow...@gmail.com on 14 Oct 2011 at 11:55

GoogleCodeExporter commented 8 years ago

Apply to trunk (r267)

Original comment by cpmdxe...@gmail.com on 14 Oct 2011 at 3:41

GoogleCodeExporter commented 8 years ago

Hi,

I applied patch to trunk and deployed, but the problem does not seem to be 
gone..
As explained in my first message, we are restarting the 2 receivers (that 
connect to separate SMSCs) every 5 minutes (by calling 
session.unbindAndClose(), session.connectAndBind() on both). The 2 senders are 
up the whole time (no connection rebind there).
So, technically, on the log we are supposed to see, every 5 minutes, both 
receivers starting and stopping their EnquireLinkSender and PDUReaderWorker. 
So, something like:

Thread-210674   PDUReaderWorker stop
Thread-211781   Starting PDUReaderWorker with processor degree:3 ...
Thread-211783   Starting EnquireLinkSender
Thread-210676   EnquireLinkSender stop
Thread-210840   PDUReaderWorker stop
Thread-211962   Starting PDUReaderWorker with processor degree:3 ...
Thread-211963   Starting EnquireLinkSender
Thread-210843   EnquireLinkSender stop

I made a lookup within the log to check that every thread-id is visible twice 
(start and stop). What I noticed, in the last lines of log (before the server 
went to 100% CPU and stopped logging) is that only one PDUReaderWorker is 
stopped, while the other PDUReaderWorker and the 2 EnquireLinkSender are still 
running (or just didn't log the stop message, which I doubt).

Do you still think this is related to the deadlock you patched?

Cheers and thanks for the help
Andrea

Original comment by rapow...@gmail.com on 18 Oct 2011 at 9:10

GoogleCodeExporter commented 8 years ago

It might be another deadlock, or perhaps my patch was not enough. 

Are you unbinding absolutely every 5 minutes (ie: even when there is incoming 
traffic ?), if so, unbindAndClose may be deadlocking with something that is 
calling a method on the session object from within your attached 
ServerMessageReceiverListener.

Try to duplicate the issue by using SMPPSim and a high volume of deliver_sm to 
your application, and then keep stopping/starting the client, when it deadlocks 
look at the stack on all threads.

In my case, one thread was waiting on getSessionState (which is called from 
within my message receiver listener) and another was waiting on unbindAndClose 
(called from the main program thread trying to shutdown).

It has to happen at the right moment, it is reproducible 60% of the time.

Original comment by cpmdxe...@gmail.com on 18 Oct 2011 at 9:20

GoogleCodeExporter commented 8 years ago

I see what you mean.. The problem is that the NORMAL BEHAVIOR it would be that 
the receivers are stopped only if nothing is received in the last 5 minutes. So 
ideally the would almost never be stopped. 
However currently in the test environment there is no incoming activity yet, so 
we restart systematically every 5 minutes.

On the other side there is some sending activity going on, but the sender has 
its own connection and is bound in TX mode. Basically there is no transceiver 
in the system.

Original comment by rapow...@gmail.com on 18 Oct 2011 at 9:29

GoogleCodeExporter commented 8 years ago

I encounter the same problem too when using the library to implement a smpp 
server. The client cannot bind as receivers again after suddently stopped with 
the following logging
2012-07-12 09:44:15,058 [Thread-169] INFO  org.jsmpp.session.SMPPServerSession 
- EnquireLinkSender stop
2012-07-12 09:44:16,056 [main] INFO  jifu.client.Smmc - Wait for SMMC to 
connect and listening on port=5017 listenTimeout=10000
2012-07-12 09:44:16,056 [main] INFO  jifu.client.Smmc - Accepting SMMC 
connection for session 4e5e50ef
2012-07-12 09:44:16,056 [Thread-172] INFO  org.jsmpp.session.SMPPServerSession 
- Starting PDUReaderWorker with processor degree:10 ...
2012-07-12 09:44:16,057 [Thread-172] INFO  org.jsmpp.session.SMPPServerSession 
- PDUReaderWorker stop
2012-07-12 09:44:16,058 [main] ERROR jifu.client.Smmc - 
java.net.SocketException: Socket closed
2012-07-12 09:44:16,058 [Thread-171] INFO  org.jsmpp.session.SMPPServerSession 
- Starting EnquireLinkSender

Original comment by qiuji...@gmail.com on 13 Jul 2012 at 3:18

sungsilnom / jsmpp

Deadlock in getSessionState() - breaks usage as client tranceiver mode and as server #107