mrniko / netty-socketio

Socket.IO server implemented on Java. Realtime java framework
Apache License 2.0
6.82k stars 1.65k forks source link

Polling clients sporatically get prematurely disconnected #264

Closed lpage30 closed 8 years ago

lpage30 commented 9 years ago

When running under load with ~500+ polling clients, we are seeing their connections getting disconnected from the server side causing a 'bloat' of TIME_WAIT connections on the server. Under certain circumstances this bloat might exceed the configured limit on the back end system and thus cause a failure.

After some investigation and 'stack dumps' we found a common stack trace (see below). Basically the EncoderHandler on sendMessage issues a writeandFlush().addListener(ChannelFuture.CLOSE); which does fire and closes the channel.

We have subclassed EncoderHandler; copied all of the behavior EXCEPT for this 'addListener()', and have run without the close issue and the channel is getting 'reused'. Further investigation/testing is needed, but this appears to be 'a bug' as the client should initiate the close thus preventing TIME_WAIT bloat and introducing a 'possible' efficiency of channel reuse.

Please investigate on your end to verify this was your specific intent at line 162 in EncoderHandler.java.

Stack trace captured at NioSocketChannel in netty, but stems from EncoderHandler.handleHTTP

Thu Aug 06 15:20:10 EDT 2015 javaChannel().close() java.nio.channels.SocketChannel[connected local=/127.0.0.1:12013 remote=/127.0.0.1:53769] java.lang.Throwable at io.netty.channel.socket.nio.NioSocketChannel.dumpStack(NioSocketChannel.java:60) at io.netty.channel.socket.nio.NioSocketChannel.doClose(NioSocketChannel.java:237) at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:558) at io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1098) at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:543) at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:528) at io.netty.channel.ChannelOutboundHandlerAdapter.close(ChannelOutboundHandlerAdapter.java:71) at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:543) at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:528) at io.netty.channel.ChannelOutboundHandlerAdapter.close(ChannelOutboundHandlerAdapter.java:71) at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:543) at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:528) at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:394) at io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:887) at io.netty.channel.AbstractChannel.close(AbstractChannel.java:181) at io.netty.channel.ChannelFutureListener$1.operationComplete(ChannelFutureListener.java:44) at io.netty.channel.ChannelFutureListener$1.operationComplete(ChannelFutureListener.java:41) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at io.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at com.corundumstudio.socketio.handler.EncoderHandler.sendMessage(EncoderHandler.java:162) at com.corundumstudio.socketio.handler.EncoderHandler.sendMessage(EncoderHandler.java:146) at com.corundumstudio.socketio.handler.EncoderHandler.handleHTTP(EncoderHandler.java:258) at com.corundumstudio.socketio.handler.EncoderHandler.write(EncoderHandler.java:197) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633) at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908) at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745)

lpage30 commented 9 years ago

We have found through wire analysis that the socket io server is not supporting piggy-backed message parts for polling clients.

We see socket.io js v1.3.5 polling clients sending messages and the underlying socket.io js code piggy-backs a ping on that same message resulting in form: ":[app-message]1:2" as the basic payload. The socket.io server in the PollingTransport will only read and process the ":[app-message]" message and 'drop' the "1:2" piggy-backed message. Line 158 PollingTransport.java v1.7.7 of socket-io server.

We are attempting to work around this by subclassing PollingTransport and pulling out the code from decoder.preprocessJson(jsonIndex, content) into a new method that will return a LIST of ByteBuf by parsing the content in a loop stripping out each message part into a ByteBuf. And then issuing multiple ctx.pipeline().fireChannelRead(new PacketsMessage(client, content, Transport.POLLING)); with each ByteBuf.

The side effect of us not doing this is the socket.io server does not respond to PING messages and the client 'gives up' and camps out without sending any other ping.... then it times out, and disconnects.

mrniko commented 9 years ago

could you provide a client code to test it?

lpage30 commented 9 years ago

I don't have a generic client for a generic server, but reproducing it with 1 javascript client is fairly easy:

my piggyback code handles as many piggy backs until end of content, but I'm only seeing ~2 msgs per content tops.

ryandietrich commented 9 years ago

I'm having a similar issue. I wrote a quick client and pointed it against the netty-socketio-demo https://github.com/mrniko/netty-socketio-demo server.

I see this on the server side (started with mvn exec:java as the docs stated).

[nioEventLoopGroup-3-11] ERROR com.corundumstudio.socketio.transport.PollingTransport - 97a8f14f-1168-4c1f-ba01-f8230683d024 is not registered. Closing connection

And on the client:

Error! [com.github.nkzawa.engineio.client.EngineIOException: xhr poll error]

The client code is available at: https://github.com/ryandietrich/socketIOStress

Fire up the demo server, then compile the code and run the "run_polling" and notice the errors that will spew out of the demo server. Then restart the demo server (it doesn't re-use the port, so your OS may take a second or two to let go of the port before you can restart). Then run the "run_websocket" and notice the lack of errors spewing out of the demo server (though depending on your ulimit you may run into thread limitations as this particular client spawns a thread for every connection (ugh!).

ryandietrich commented 9 years ago

I think this issue is exacerbated by the policy change from socketIO 0.9 to 1.0. Before it had a downgrade policy, and now it has an upgrade policy, meaning the javascript client will start with polling and will upgrade if the server says it is capable of supporting websockets. To address this my team has added some client side detection and specified the transport instead of letting the client start with Polling.

lpage30 commented 9 years ago

We worked around the bug, and it is a bug. Very easy to fix, pain to work around. ROOT CAUSE: PacketDecoder.preprocessJson() 1) strips out the 'd=' good 2) clean up the '\n' good

But then.... the bad: 3) parses the content to strip out only the 1st message in the payload, and then returns that to the calling class (PollingTransport.onPost())

FIX: do 1, and 2, but do NOT do 3 Instead of parsing the content and returning the parsed result, Simply return the 'normalized' payload (normalized by steps 1 and 2) up.

The remainder of the code tree handles piggy-backed messages nicely. Our work around was painful because the bug is fairly 'deep' in the framework, so we had to subclass it from the SocketIOChannelInitializer down for a simple fix like this.

lpage30 commented 9 years ago

BTW: Nkzawa does NOT support JSONP polling it only does XHR, and so this pariticular piggy-back issue we found is NEVER hit by Nkzawa. We are using both NKzawa and regular javascript clients and hit this issue with javascript. I would advise you force Websockets with nkzawa clients this can easily be done.

ryandietrich commented 9 years ago

So, is the test program I wrote hitting a totally different problem entirely? socketIOStress is merely connect/disconnecting very quickly. When you specify Polling transports, problems pop out of everywhere. With websockets, no issues are seen. Are you saying Websockets also have problems?

lpage30 commented 9 years ago

We have been using 2 varieties of load testing clients: Node.JS using the stock latest socket.io js client Java/Grinder load using Nkzawa client.

In both cases we have used polling clients along with websocket clients. We're talking 3,000 polling and 12,000 total (9,000 websocket). We did take to using the Node.JS more for polling clients because Nkzawa did not perform as well. BUT Nkzawa was working at around 300-500 polling clients, with some sporadic connection drops which we have since attributed to the CLIENT (nkzawa) code because we don't see the same issue in our JS client.

Taking the long way to answer your question: 'yes' its a different problem entirely.

lpage30 commented 9 years ago

Oh, and to add to the rest of the setup:

mrniko commented 9 years ago

fixing the piggy-backed JSONP issue

It's nice to see a pull-request for this, could you do that, please?

pramodbms commented 8 years ago

I am also facing the same issue as @ryandietrich When the client is using a browser not supporting websockets, it falls back to polling. It works for few seconds and then I see the same messages as pointed out by @ryandietrich :

Client Side: Error: xhr post error at XHR.Transport.onError

Server Side: 2015-12-02 19:54:08,241 ERROR nioEventLoopGroup-3-40 PollingTransport 1045fa15-8c1e-40d8-b1a1-a030eed32fb4 is not registered. Closing connection 2015-12-02 19:54:08,248 ERROR nioEventLoopGroup-3-41 PollingTransport 1045fa15-8c1e-40d8-b1a1-a030eed32fb4 is not registered. Closing connection

Is this issue fixed? I am on version "1.7.5".

mrniko commented 8 years ago

@pramodbms Could you try to comment out .addListener(ChannelFutureListener.CLOSE) at EncoderHandler:162. And report me, does it help to resolve problem?