telefonicaid / iotagent-json

IoT Agent for a JSON based protocol (with HTTP, MQTT and AMQP transports)
https://fiware-iotagent-json.rtfd.io/
GNU Affero General Public License v3.0
49 stars 89 forks source link

Docker image : out of memory error in scenarios with high input traffic (HTTP transport) #336

Closed dcalvoalonso closed 5 years ago

dcalvoalonso commented 5 years ago

When this agent is executed using a Docker image in scenarios with high input traffic, it performs an excessive consumption of RAM memory that finishes causing an 'Out of memory' error. This bug happens using HTTP as the transport protocol. Please find attached the following files:

dcalvoalonso commented 5 years ago

In our tests, the solution covers:

fgalan commented 5 years ago

What does PM2 exactly provides? It is a kind of watchdog for the node process? How it compares with other alternatives (such as forever or monit)?

dcalvoalonso commented 5 years ago

PM2 is a production process manager. Basically, it has similar features to forever or monit, but in my opinion, it is a bit more powerful and intuitive. There are many comparisons available on the Internet:

In general, we have had good experiences working with it.

fgalan commented 5 years ago

Once PR #337 has been merged, should this issue be closed?

dcalvoalonso commented 5 years ago

Sure @fgalan !! As always, thanks for your comments! ;)

efviodo commented 5 years ago

Nice work!!

Just a few comments:

Thanks and again great work!

dcalvoalonso commented 5 years ago

Hi @efviodo,

First of all, thanks for your inputs.

Since the PR was accepted, it would be desirable for the agent's installation documentation to be updated, explaining how to deploy and configure it using PM2. Also performance hints like other FIWARE components has (e.g: cygnus, Orion), would be really appreciated.

  • As it is explained in the issue, the performance issue only affected the installation using Docker images. Therefore, I think that it's not needed to update the installation instructions since the PM2 setup is abstracted by the Docker image. Anyway, as you can see in https://github.com/telefonicaid/iotagent-json/pull/337/files, the instructions regarding how to use the Docker image were updated.

I felt curious and I opened the Dockerfile of the release 1.9.0, which includes this PR. The agent process is configured without parallelism. This means that a single agent process, is being setting up behind the PM2 process. Is this configuration okay? In order to obtain more stability in scenarios with high traffic input you should configure multiples agent instances with PM2 enabling the cluster mode. In this way, the load is distributed among multiple instances of the agent. Maybe this is a good subject to explain more in detail in a possible documentation with performance tips.

About the results of performance tests shown in PR337 with new configuration. How many instances of the agent did you configure? Also, in the third image where the response times of the agent are shown, I see peaks of almost 36000 ms(36 seconds), and the average seems to be in 16000 ms. Are these response times acceptable?

Finally, why using PM2 and not some high-performance load balancer like NGINX in order to scale out the agent?

fgalan commented 5 years ago

Only some minor addition to the outstanding answer that @dcalvoalonso has provided.

The tests were done with a single instance of the IoT Agent. I agree that the response times are high but you have to take into account that the tests introduce an average load over 100 transactions per second in a continuous way. This type of tests allows measuring the performance of the agent and verifying its stability over the time but in order to use it in production in such conditions, it would be necessary to scale up the number of instances and probably also the number of context brokers and MongoDB databases.

ContextBroker typically supports a minium throughput of around ~1000 tps second in conventional setups (it can be even more depending of performance tunnning, see https://fiware-orion.readthedocs.io/en/master/admin/perf_tuning/index.html) so maybe scaling it is not necessary. Of course, tests would need to be conducted to check if my guessing is correct and where the bottleneck of the system is :)