Closed runafter closed 9 years ago
This bug can be reproduced with massive CONNECT message from over 10K clients.
This is the process of CONNECT message.
Usually theses process are no problem. But if massive CONNECT messages are received, it can be issue.
Let's suppose that 1000 CONNECT messages are received. Received messages will be put in SimpleMessaging.m_ringBuffer. And then SimpleMessaging.m_executor will process message in SimpleMessaging.m_ringBuffer one by one. If it takes 100ms to process a message, 1000 CONNECT messages are processed in 100 seconds. So worst case, last CONNECT message will be processed in 100 seconds.
But if ServerChannel is not processed within 10 seconds, it is closed by MoquetteIdleTimoutHandler. (https://github.com/runafter/moquette/blob/master/broker/src/main/java/org/eclipse/moquette/server/netty/MoquetteIdleTimoutHandler.java#L31)
Whether or not ServerChannel is closed, CONNECT message is processed by ProtocolProcessor.processConnect().
In my opinion, CONNECT message with closed ServerChannel is not necessary to process. But it is processed.
This situation makes performance degradation.
So if over 100 per seconds CONNECT messages are received and it takes 100ms to process a message, about 112 messages are processed but the others are failed.
Hi @runafter your analysis is correct, but on the fix we should speak a little bit, this issue is also related to #102 and is due to an architectural fail (in my opinion), the ring buffer do a great job to keep the ProtocolProcessor single threaded and simple to hack, without concerns on concurrent access to common data structures, but this produce these latency related bugs. I think that I've to rework this aspect of the broker.
Hi @runafter I changed the way Moquette works so no more a ring buffer and single thread processor,this means that latency errors shouldn't happened and this bug is fixed. Andrea
As I know, LMAX Disruptor was built to address the latency of inter-thread communication but now you remove it because it seems to produces latency related bugs. So I'm a bit confused... Anyway you are the master therefore: 1) I will redo all our stress tests for evaluate the new implementation 2) maybe you need to change the main-page description of Moquette ;-)
Hi Diego, you are right,disruptor is for low latency inter thread comunication,but the structure (single thread Protocol Processor) was limiting the threading. In front we have n Netty that uses 1 thread per core and never block,so it's time to move Moquette to use full power of Netty.Now the protocol processor is executed in Netty's threads and use persistent data structures(the subscription tree is always copied upon modification). In this picture the ring buffer was just a removeable step (I like that LMAX stuff). That's the motivation. Let me know of new bugs:-) Andrea Il giorno 11/ott/2015 01:19, "diegovisentin" notifications@github.com ha scritto:
As I know, LMAX Disruptor was built to address the latency of inter-thread communication but now you remove it because it seems to produces latency related bugs. So I'm a bit confused... Anyway you are the master therefore: 1) I will redo all our stress tests for evaluate the new implementation 2) maybe you need to change the main-page description of Moquette ;-)
— Reply to this email directly or view it on GitHub https://github.com/andsel/moquette/issues/101#issuecomment-147130265.
Hi @andsel Thanks for your work. This changes are expected to make performance improvement for processing MQTT connection.
When server is too busy, some ConnectMessage cannot be processed untill Constants.DEFAULT_CONNECT_TIMEOUT (10 seconds). In this case, IdleStateHandler and MoquetteIdleTimoutHandler make channel to close. But ProtocolProcess.processConnect() processes all messages connected and not connected.
Here is my logs and test code.
Whole test code is here https://github.com/runafter/moquette/commit/2f510ed61167b0e4573147345360afea37403d91