moscajs / mosca

MQTT broker as a module
mosca.io
3.2k stars 508 forks source link

Why there is a maxConnections set to 100k in server.js? #303

Closed cfan8 closed 9 years ago

cfan8 commented 9 years ago

Code here.

I'm trying to build a push system using mosca.js. While I was doing some concurrency test, I found this restraint that limits the total connections that a mosca server can hold to 100k. I changed it to a larger number and see no problem when the concurrent connections reached 250k on the server side. May I ask what's the purpose of setting this number and will there be any side effect if I changed it?

behrad commented 9 years ago

hadn't seen that @mcollina However I was able to reach to 240K client connections on a scaled up mosca using cluster. That means each of my node.js processes were below 100K. Have you reached 250K on a single process @cfan8 ?

Can we share more on this? with which client lib? keep-alive value? server cpu/RAM when 250K connected?

you can check https://github.com/mcollina/mosca/issues/263 and I can share more here

cfan8 commented 9 years ago

@behrad Single process with MQTT.js, the key is to slow down heartbeat (which I set to 4 minutes) and slow down the speed of creation of connections (~500 clients/second). ~8GB RAM, little CPU/idle + 100% CPU for 10s/broadcasting every client.

behrad commented 9 years ago

Single process with MQTT.js, the key is to slow down heartbeat (which I set to 4 minutes) and slow down the speed of creation of connections (~500 clients/second). ~8GB RAM, little CPU/idle

very nice @cfan8 Heartbeat was 2 min in my case, since was using clustered processes our memory was higher, around 12G, If I remember right, for 240K with 12 core machine (12 processes) and around 2, 3% cpu usage.

100% CPU for 10s/broadcasting every client

didn't get what you mean by 10s/broadcasting !?

Have you tested different publish rates under such a load: total 1-1 publish qos=0,1 per second? reponse of 1-n publish qos=0,1 with very large n?

cfan8 commented 9 years ago

@behrad My case is only a rough test, each client subscribe to the same channel, and broadcasting means send one message to this only channel. CPU will stay at 100% until all messages are sent. My case is QoS 0, I believe the busy time will be doubled if you choose QoS 1. Reconnect timeout should be extended to avoid reconnecting requests of filling up the event queue. Reducing the r/w buffer size used by the linux kernel can reduce the memory use.

My concern is that I observed a growing usage of memory, I'm not quite sure whether this is caused by V8's memory management strategy, or there are some leaking places in the code.

behrad commented 9 years ago

My case is only a rough test, each client subscribe to the same channel, and broadcasting means send one message to this only channel.

one of our cases also

Reducing the r/w buffer size used by the linux kernel can reduce the memory use.

Can u provide us with the exact parameter and values? You can check mine in the above link.

My concern is that I observed a growing usage of memory, I'm not quite sure whether this is caused by V8's memory management strategy, or there are some leaking places in the code

Mine either. I haven't run a long running test to see if memory will grow forever. Are you manipulating v8 garbage collector options like use_idle_notification or turn it off?

I also saw a periodic tick on number of clients reported by mosca. e.g. each 5mins a number of clients got disconnected and reconnected again!

How are you organizing your clients @cfan8 ? I produced 4 processes of MQTT.js on each machine upto 60K connections. Have you been able to produce more connections on a client machine?

cfan8 commented 9 years ago

net.ipv4.tcp_mem = 786432 1697152 1945728 net.ipv4.tcp_rmem = 4096 4096 16777216 net.ipv4.tcp_wmem = 4096 4096 16777216

This is the one I'm using, I assume there are a lot of articles talking about this on the internet.

I did nothing to the V8 GC, except that I increased the max-old-space-size to let V8 use more memory than the default ~1.4G.

Have you been able to produce more connections on a client machine?

Yeah, just wait for the 4*60K connections to establish and when CPU drops under 10%, you can start new client processes.

I also saw a periodic tick on number of clients reported by mosca. e.g. each 5mins a number of clients got disconnected and reconnected again!

Same, I'm not quite sure what's going wrong. Slowing down connection speeding and reducing the number of occupied ports on a certain IP on the client side seems to alleviate it but with no guarantee.

behrad commented 9 years ago

This is the one I'm using

about the same config here.

I did nothing to the V8 GC, except for increase the max-old-space-size to let V8 use more memory.

forgot to say that I'm running on latest iojs, it showed performance improvements. go for it if you are still on node.js

Yeah, just wait for the 4*60K connections to establish and when CPU drops under 10%, you can start new client processes.

How are you able issuing 240K connections from a client machine? You should be setting 4 IP addresses to do that. I was unable to do the same, stuck in 60K connections with each of my clients.

Same, I'm not quite sure what's going wrong. Slowing down connection speeding and reducing the number of occupied ports on a certain IP on the client side seems to alleviate it but with no guarantee.

@mcollina can comment on this. My suspect is v8's GC which will freeze mosca for seconds, that would cause many clients get keep alive timeouts, but after a sec they all get reconnected. I've no prove for this yet. You can test and see if this happens while gc turned off.

mcollina commented 9 years ago

@behrad you are close. Mosca has a high churn of objects and functions, mainly it allocates a lot of objects and functions to operate: this is exceptionally bad, but I did not know better when I originally wrote Mosca. I am planning for a future release that rewrites Mosca on top of Aedes, which does not have this problem (I know better now).

As far as I know there are no memory leaks, and memory returns to normal once the load decreases. Node 0.10 is extremely discouraged, because the GC is much more delicate, and it can crash the process under massive load. As @behrad said, iojs is what should be used.

Anyway, the performance numbers you are getting are impressive folks!!!

behrad commented 9 years ago

mainly it allocates a lot of objects and functions to operate: this is exceptionally bad, but I did not know better when I originally wrote Mosca. I am planning for a future release that rewrites Mosca on top of Aedes, which does not have this problem (I know better now)

Can you share any concrete examples of these probably node.js idioms @mcollina ? I love to learn, and even to contribute to Aedes if I have enough time.

behrad commented 9 years ago

and are you ok with removing that maxConnection limit @cfan8 has found !?

mcollina commented 9 years ago

I'm ok for removing the maxConnection limit.

If you want to reduce (or probably eliminate) the keepalive issue, we should move Mosca to use retimer. I wrote it for Aedes, but it can be easily backported.

As for these idioms, this block allocates a lot of functions: https://github.com/mcollina/mosca/blob/master/lib/client.js#L85-L138 while this does not: https://github.com/mcollina/aedes/blob/master/lib/handlers/index.js#L22-L56

behrad commented 9 years ago

:+1:

cfan8 commented 9 years ago

How are you able issuing 240K connections from a client machine? You should be setting 4 IP addresses to do that. I was unable to do the same, stuck in 60K connections with each of my clients.

@behrad Definitely, build up a small intranet, set up virtual interfaces and make sure your connection to your server does not go through a NAT.

Mosca has a high churn of objects and functions, mainly it allocates a lot of objects and functions to operate: this is exceptionally bad, but I did not know better when I originally wrote Mosca. I am planning for a future release that rewrites Mosca on top of Aedes, which does not have this problem (I know better now).

@mcollina The same thing happens when I put mosca behind a reverse proxy, i.e. HAProxy, and such phenomenon still exists. GC cannot explain this. I haven't dig into mosca's code a lot. If the keepalive mechanism is built based on a send/ack mode, then extending the timeout limit between send & ack may help with this issue. I forgot, this is a client side feature.

cfan8 commented 9 years ago

In a recent PR this limit has been raised to 10M, so far so good. I think this issue can be closed.

mcollina commented 9 years ago

Great! :)

behrad commented 9 years ago

created 10 virtual interfaces with 10 ips on my cent7 guest machine, but couldn't yet generate more than 60K connections @cfan8 , Can you show me a snippet how you achieved this with mqttjs !?

cfan8 commented 9 years ago

@behrad This is the file I'm using.

var mqtt = require('mqtt');
var msgCount = 0;
var clientCount = 0;
var reconnect = 0;
var closeCount = 0;
var errorCount = 0;
var serverport = 2883;
var serverip = '192.168.0.100';
var net = require('net');
var clientip = process.argv[2];
function randomIntInc(low, high) {
    return Math.floor(Math.random() * (high - low + 1) + low);
}
for (var i = 0; i < 60000; i++) {
    setTimeout(function (ivalue) {
        var client = mqtt.Client(
            function () {
                return net.connect({host: serverip, port: serverport, localAddress: '192.168.0.1' + clientip})
            },
            {
                reconnectPeriod: 30 * 1000,
                keepalive: 4 * 60,
            });
        client.on('connect', function () {
            clientCount++;
        });

        client.on('message', function (topic, message) {
            msgCount++;
        });

        client.on('reconnect', function () {
            reconnect++;
            clientCount++;
        });

        client.on('close', function () {
            closeCount++;
            clientCount--;
        });

        client.on('error', function (err) {
            errorCount++;
            console.log(err);
        })
    }, randomIntInc(1, 1000 * 120), i);
}
setInterval(function () {
    console.log('MQTT Connected:' + clientCount + ', ReConnect:' + reconnect + ', Close:' + closeCount + ', ERROR:' + errorCount + ',Received:' + msgCount + '.');
}, 1000);

And use this to start it.

node --max-old-space-size=8192 clienttester.js 0

My client IP goes from 192.168.0.10.

mcollina commented 9 years ago

Nice one! Why don't you use the stock mqtt.connect() call and you create a Client directly? Is that the source of the issues?

Would you like to contribute it to MQTT.js? I think that can be quite useful for a lot of people.

behrad commented 9 years ago

Thank you @cfan8 almost the same here, however when I checked randomIntInc found that my clients were rushing too fast

behrad commented 9 years ago

Nice one! Why don't you use the stock mqtt.connect() call and you create a Client directly? Is that the source of the issues?

mqttjs doesn't currently accept localAddress https://github.com/mqttjs/MQTT.js/issues/298, I will create a PR for this

mcollina commented 9 years ago

@behrad thanks! :)

behrad commented 9 years ago

@cfan8 yet not able to produce more connections in my VM, i've increased local_port_range also...

cfan8 commented 9 years ago

@behrad Maybe a VM is not powerful enough compared to my desktop. Mine is a i5-4590. I think you should further slowing down the connection speed and monitor the cpu usage. By the way, I was looking for a good MQTT Java client to be used on Android. I saw your comments on some projects. Have you got a decision now?

behrad commented 9 years ago

Maybe a VM is not powerful enough compared to my desktop. Mine is a i5-4590. I think you should further slowing down the connection speed and monitor the cpu usage

VM has 4 X Xeon E5-2690 3GHz + 4G RAM, so I don't think it be a HW issue, after reaching my local_port_range I can't issue further client connections from that VM. I'm also having a memory problem with my mosca broker currently if not started in cluster mode.

By the way, I was looking for a good MQTT Java client to be used on Android. I saw your comments on some projects. Have you got a decision now?

We found two choices:

  1. Eclipse Paho Client (reliable but you should built reconnect+... upon)
  2. Fusesource MQTT client which implements reconnect. We are using this for now (android>4) despite some argue that it has problems... we are OK with it currently.
cfan8 commented 9 years ago

@behrad I'm not quite sure if I stated clearly. Of course you can only have at most ~60k connections from one IP. You can assign multiple IPs to one machine and start several test instances each with a different IP. Further more, no NAT on your link to the server.

We are using this for now (android>4) despite some argue that it has problems... we are OK with it currently.

What about the WAKE_LOCK on android? Is your client able to receive messages and send notifications when and go to sleep?

behrad commented 9 years ago

Yes, I knew this. I am starting new script instances with different ips.... both machines (client and server) are in the same local network.

What about the WAKE_LOCK on android? Is your client able to receive messages and send notifications when and go to sleep?

yes, you should handle it. fusesource mqtt client knows nothing about android, I think neither does paho about wake lock :)

cfan8 commented 9 years ago

@behrad I'm using Paho and found there is a serious problem in their design: callback leak. My callback is not guaranteed to be called every time.

behrad commented 9 years ago

@cfan8 seriously? Paho should be used in many mobile projects... However mqtt-client by fusesource is also used in ActiveMQ and some others... I've alse faced a connection problem with SSL on android >5.0 using mqtt-client, I should debug it myself :(

cfan8 commented 9 years ago

@behrad Yes. The problem is that they throw Exceptions everywhere and then try to catch them in a asynchronized model. However, sometimes they forget to deal with the callbacks when exceptions are caught.

behrad commented 9 years ago

Didya mean mqtt-client by they? Have you faced the exact same problem where the client seems not to call the callback when using non-blocking API using SSL on android >5? on android 4 it is ok.

cfan8 commented 9 years ago

I mean Paho. :(

raiscui commented 9 years ago

@cfan8 can share you fixed Paho?

cfan8 commented 9 years ago

@raiscui I tried to but not completely. There are some obvious ones that only cost you a few minutes to fix. However, I can still see some leaks but I don't have time to trace down the original leak point.