google-code-export / red5

Automatically exported from code.google.com/p/red5
0 stars 0 forks source link

Mina architecture with RTMP can lead to deadlock. #164

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
In certain situations, the locking architecture used by Red5 can lead to 
deadlock:

1. Client A connects and BaseConnection.connect acquires A's conn lock
2. Client B connects and BaseConnection.connect acquires B's conn lock
3. Connect handler for A wishes to send B a message to say they are coming 
online. They write to B's connection
3. Connect handler for B wishes to send A a message to say they are coming 
online. They write to A's connection
4. Through mina filter chain stack, A's handler makes 
ProtocolCodecFilter.filterWrite calls into RTMPMinaProtocolEncoder.encode  and 
this requests lock on B's connection.
5. Handler for A waits for lock held by B
6. Through mina filter chain stack, B's handler makes 
ProtocolCodecFilter.filterWrite calls into RTMPMinaProtocolEncoder.encode  and 
this requests lock on A's connection.
7. Handler for B waits for lock held by A
--- DEADLOCK--

Example stack trace of deadlocked threads:

"pool-8-thread-197" - Thread t@474
   java.lang.Thread.State: WAITING
    at sun.misc.Unsafe.park(Native Method)
    - waiting to lock <7e521182> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by "pool-8-thread-118" t@394
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
    at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
    at org.red5.server.net.rtmp.codec.RTMPMinaProtocolEncoder.encode(RTMPMinaProtocolEncoder.java:50)
    at org.apache.mina.filter.codec.ProtocolCodecFilter.filterWrite(ProtocolCodecFilter.java:322)
    at org.apache.mina.core.filterchain.DefaultIoFilterChain.callPreviousFilterWrite(DefaultIoFilterChain.java:509)
    at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1400(DefaultIoFilterChain.java:46)
    at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.filterWrite(DefaultIoFilterChain.java:808)
    at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.filterWrite(DefaultIoFilterChain.java:734)
    at org.apache.mina.core.filterchain.DefaultIoFilterChain.callPreviousFilterWrite(DefaultIoFilterChain.java:509)
    at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireFilterWrite(DefaultIoFilterChain.java:501)
    at org.apache.mina.core.session.AbstractIoSession.write(AbstractIoSession.java:490)
    at org.apache.mina.core.session.AbstractIoSession.write(AbstractIoSession.java:435)
    at org.red5.server.net.rtmp.RTMPMinaConnection.write(RTMPMinaConnection.java:273)
    at org.red5.server.net.rtmp.Channel.write(Channel.java:121)
    at org.red5.server.net.rtmp.Channel.write(Channel.java:104)
    at org.red5.server.net.rtmp.RTMPConnection.invoke(RTMPConnection.java:809)
    at org.red5.server.net.rtmp.RTMPConnection.invoke(RTMPConnection.java:778)
    at org.red5.server.net.rtmp.RTMPConnection.invoke(RTMPConnection.java:833)
    at org.wyki.networking.starburst.UserContext.flushMessages(UserContext.java:478)
    - locked <3dfbae98> (a org.wyki.networking.starburst.UserContext)
    at org.wyki.networking.starburst.UserContext.send(UserContext.java:369)
    - locked <3dfbae98> (a org.wyki.networking.starburst.UserContext)
    at org.wyki.networking.starburst.EventRoutingManager.notifyLocalListenersOnPath(EventRoutingManager.java:608)
    at org.wyki.networking.starburst.EventRoutingManager.notifyLocalListeners(EventRoutingManager.java:532)
    at org.wyki.networking.starburst.EventRoutingManager.notifyEventListeners(EventRoutingManager.java:253)
    at org.wyki.networking.starburst.Starburst.notifyEvent(Starburst.java:197)
    at com.fightmymonster.game.GameConnectionWatcher.onConnect(GameConnectionWatcher.java:16)
    at org.wyki.networking.starburst.Starburst.onRed5Connect(Starburst.java:446)
    at org.wyki.networking.starburst.StarburstRed5App.connect(StarburstRed5App.java:135)
    at com.fightmymonster.Application.connect(Application.java:355)
    at org.red5.server.Scope.connect(Scope.java:335)
    at org.red5.server.BaseConnection.connect(BaseConnection.java:342)
    at org.red5.server.net.rtmp.RTMPConnection.connect(RTMPConnection.java:261)
    at org.red5.server.net.rtmp.RTMPMinaConnection.connect(RTMPMinaConnection.java:121)
    at org.red5.server.BaseConnection.connect(BaseConnection.java:315)
    at org.wyki.networking.starburst.ServiceRmiHandler.doChangeService(ServiceRmiHandler.java:377)
    at org.wyki.networking.starburst.AuthenticateOperation.doWork(AuthenticateOperation.java:134)
    at org.wyki.networking.starburst.SyncRmiOperation.run(SyncRmiOperation.java:50)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
    - locked <2d1a06c> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

    - locked <624f75c6> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)

QUICK HEADS UP ON ANOTHER ISSUE REVEALED BY THIS: In my application, once a 
couple of threads have deadlocked, it ripples out with more and more threads 
getting deadlocked and progressive degradation of service. However deadlocking 
threads reveal another serious issue, which is that the system begins leaking 
socket file descriptors like crazy. Thus twenty minutes into the deadlock 
situation, Red5 has leaked a few thousand file descriptors. Presumably 
deadlocked connections are being timed out in a way that leaks the fd and RTMP 
clients are trying to reconnect or something, but that is for another issue 
(hints appreciated if any ideas) So anyway finally I get total meltdown - a 
race to bottom between file descriptor related I/O issues and thread pool 
exhaustion

I need a quick fix to this, but can see the complexities in 
https://issues.apache.org/jira/browse/DIRMINA-845 .. notwithstanding..

Am thinking that maybe in RTMPMinaProtocolEncoder we could simply lock on 
ProtocolEncoderOutput out object and with only minor mods to RTMPConnection 
this might suffice for serialization, based on examination of what goes on 
inside encodePacket and dropMessage. I can also see that a lot of the logic in 
there deals with video or audio specifically, and am thinking maybe writing 
method call into connection wouldn't need same level synchronization.

Ideas appreciated?

Original issue reported on code.google.com by dwilli...@system7.co.uk on 15 Dec 2011 at 12:40

GoogleCodeExporter commented 9 years ago
Ok the problem is wider than first anticipated. Looking into this it seems Red5 
application and system deadlock can happen in numerous ways given the current 
architecture... if someone can confirm I'd be grateful.

For example, if an application publishes a server method, say 
ChatService.sendMessage, and the implementation makes a call on the connection 
of another user. Should both users send each other messages at exactly the same 
moment, you deadlock..

In fact it seems any system where input from user A can drive output to user B, 
and input from user B can drive output to user A, can cause deadlock !(?)

Here are the calls where Red5 acquires a lock on the connection object that can 
lead to deadlock:-
RTMPMinaProtocolDecoder.decode
RTMPMinaProtocolEncoder.encode
RTMPMinaIoHandler.rawBufferReceived
RTMPConnection.open, RTMPConnection.close
BaseConnection.connect, BaseConnection.close

And to be clear, this is not a hypothetical problem. We now understand this to 
be behind unresolved system issues plaguing us in production this year.

If the analysis is correct, the entire Red5 locking and concurrency model needs 
to be urgently overhauled. 

I believe this is likely the root cause of a majority of mysterious lockups and 
leaks reported (leaks because Red5 often starts leaking sockets at an 
incredible rate once connection threads deadlock with each other)

This is currently an urgent threat facing all Red5 installations. Fixing it 
will take Red5 forward a huge leap, but it is currently a critical flaw. 

Original comment by dwilli...@system7.co.uk on 15 Dec 2011 at 2:06

GoogleCodeExporter commented 9 years ago

Original comment by mondain on 15 Dec 2011 at 5:21

GoogleCodeExporter commented 9 years ago

Original comment by mondain on 15 Dec 2011 at 9:36

GoogleCodeExporter commented 9 years ago
Probably worth mentioning: a way around this is to process RMI calls that might 
cause output to another connected client asynchronously. That is, you receive 
the parameter input, and pass that to a runnable operation object passed to an 
execution pool. In our project we have an asynchronous RMI system that sits on 
top of the standard Red5 for other reasons (for example, so call results can be 
returned as they are calculated, rather than invocation order) and for this 
reason we did not hit this deadlock bug too often. However, certain calls did 
not run through the asynchronous system for speed purposes, in particular 
instant chat messages, as per the example above, which we have now pushed to 
our asynchronous mechanism. It is imperative that if your RMI calls can, 
through whatever call chain, cause output to another connection, this is moved 
to such a mechanism or you risk deadlock.

Original comment by dwilli...@system7.co.uk on 18 Feb 2012 at 11:28

GoogleCodeExporter commented 9 years ago
I have reworked the thread and mina protocol handling in revision 4325. I 
believe the issues noted here should now be resolved.

Original comment by mondain on 26 Apr 2012 at 8:21

GoogleCodeExporter commented 9 years ago
Thank you very much for working on this.

Original comment by m...@paradisesoftware.net on 26 Apr 2012 at 10:13

GoogleCodeExporter commented 9 years ago
Sounds good. Really looking forward to checking out!

Original comment by dwilli...@system7.co.uk on 27 Apr 2012 at 10:43