Docker image : out of memory error in scenarios with high input traffic (HTTP transport)

dcalvoalonso commented 5 years ago

When this agent is executed using a Docker image in scenarios with high input traffic, it performs an excessive consumption of RAM memory that finishes causing an 'Out of memory' error. This bug happens using HTTP as the transport protocol. Please find attached the following files:

Jmeter file to simulate traffic and reproduce the bug scriptJMeterSoporte.zip
Screenshots with the bug, response times and number of requests per second:

dcalvoalonso commented 5 years ago

In our tests, the solution covers:

Use PM2 to execute the agent.
To run the Docker image with -it –init option (recommended by https://github.com/nodejs/docker-node/blob/master/docs/BestPractices.md),

fgalan commented 5 years ago

What does PM2 exactly provides? It is a kind of watchdog for the node process? How it compares with other alternatives (such as forever or monit)?

dcalvoalonso commented 5 years ago

PM2 is a production process manager. Basically, it has similar features to forever or monit, but in my opinion, it is a bit more powerful and intuitive. There are many comparisons available on the Internet:

In general, we have had good experiences working with it.

fgalan commented 5 years ago

Once PR #337 has been merged, should this issue be closed?

dcalvoalonso commented 5 years ago

Sure @fgalan !! As always, thanks for your comments! ;)

efviodo commented 5 years ago

Nice work!!

Just a few comments:

Since the PR was accepted, it would be desirable for the agent's installation documentation to be updated, explaining how to deploy and configure it using PM2. Also performance hints like other FIWARE components has (e.g: cygnus, Orion), would be really appreciated.
I felt curious and I opened the Dockerfile of the release 1.9.0, which includes this PR. The agent process is configured without parallelism. This means that a single agent process, is being setting up behind the PM2 process. Is this configuration okay? In order to obtain more stability in scenarios with high traffic input you should configure multiples agent instances with PM2 enabling the cluster mode. In this way, the load is distributed among multiple instances of the agent. Maybe this is a good subject to explain more in detail in a possible documentation with performance tips.
About the results of performance tests shown in PR337 with new configuration. How many instances of the agent did you configure? Also, in the third image where the response times of the agent are shown, I see peaks of almost 36000 ms(36 seconds), and the average seems to be in 16000 ms. Are these response times acceptable?
Finally, why using PM2 and not some high performance load balancer like NGINX in order to scale out the agent?

Thanks and again great work!

dcalvoalonso commented 5 years ago

Hi @efviodo,

First of all, thanks for your inputs.

Since the PR was accepted, it would be desirable for the agent's installation documentation to be updated, explaining how to deploy and configure it using PM2. Also performance hints like other FIWARE components has (e.g: cygnus, Orion), would be really appreciated.

As it is explained in the issue, the performance issue only affected the installation using Docker images. Therefore, I think that it's not needed to update the installation instructions since the PM2 setup is abstracted by the Docker image. Anyway, as you can see in https://github.com/telefonicaid/iotagent-json/pull/337/files, the instructions regarding how to use the Docker image were updated.

I felt curious and I opened the Dockerfile of the release 1.9.0, which includes this PR. The agent process is configured without parallelism. This means that a single agent process, is being setting up behind the PM2 process. Is this configuration okay? In order to obtain more stability in scenarios with high traffic input you should configure multiples agent instances with PM2 enabling the cluster mode. In this way, the load is distributed among multiple instances of the agent. Maybe this is a good subject to explain more in detail in a possible documentation with performance tips.

Regarding this point, using our proposal to scale address high availability/traffic scenarios would be to rely on Docker Swarm or Kubernetes in order to have multiple instances of the agent. However, I agree with you that this topic is not covered yet by the documentation.

About the results of performance tests shown in PR337 with new configuration. How many instances of the agent did you configure? Also, in the third image where the response times of the agent are shown, I see peaks of almost 36000 ms(36 seconds), and the average seems to be in 16000 ms. Are these response times acceptable?

The tests were done with a single instance of the IoT Agent. I agree that the response times are high but you have to take into account that the tests introduce an average load over 100 transactions per second in a continuous way. This type of tests allows measuring the performance of the agent and verifying its stability over the time but in order to use it in production in such conditions, it would be necessary to scale up the number of instances and probably also the number of context brokers and MongoDB databases.

Finally, why using PM2 and not some high-performance load balancer like NGINX in order to scale out the agent?

The introduction of PM2 was not done in order to increase the performance but to mitigate the problem of excessive use of RAM memory when the agent is executed within Docker. As I said before, to address a high-performance scenario, I would opt for Kubernetes/Docker Swarm and a load balancer as you suggest.

fgalan commented 5 years ago

Only some minor addition to the outstanding answer that @dcalvoalonso has provided.

The tests were done with a single instance of the IoT Agent. I agree that the response times are high but you have to take into account that the tests introduce an average load over 100 transactions per second in a continuous way. This type of tests allows measuring the performance of the agent and verifying its stability over the time but in order to use it in production in such conditions, it would be necessary to scale up the number of instances and probably also the number of context brokers and MongoDB databases.

ContextBroker typically supports a minium throughput of around ~1000 tps second in conventional setups (it can be even more depending of performance tunnning, see https://fiware-orion.readthedocs.io/en/master/admin/perf_tuning/index.html) so maybe scaling it is not necessary. Of course, tests would need to be conducted to check if my guessing is correct and where the bottleneck of the system is :)

telefonicaid / iotagent-json

Docker image : out of memory error in scenarios with high input traffic (HTTP transport) #336