telefonicaid / fiware-orion

Context Broker and CEF building block for context data management, providing NGSI interfaces.
https://github.com/telefonicaid/fiware-orion/blob/master/doc/manuals/orion-api.md
GNU Affero General Public License v3.0
211 stars 265 forks source link

OCB not responding after some time #3684

Open MichelSc opened 4 years ago

MichelSc commented 4 years ago

This is our setup: we have an OPC UA IotAgent that gathers device data and sends them to Orion Context Broker, to be further processed by a Perseo node or a QuantumLeap node.

We use version 1.3.7 of OPCUA IotAgent and version 2.4.0-next of Orion Context Broker. Both are run in docker containers with the standard fiware images.

This is what we observe: after some time, OCB stops handling requests from the iotagent. Nothing useful found in the logs: OCB stops logging at the moment it stops responding. As from this moment: 1) OCB container is still working, 2) OCB process is still alive and listening to port 1026, 3) there is a number of open connections on port 1026 on the OCB side in status ESTAB which grows forever (probably 1 or 2 per incoming not handled incoming iotagent http request).

In our setup, the OCB container is started by docker-compose as a result of the dependencies declared in the docker-compose file.

So when we do 'docker-compose up iotagent', docker-compose starts the OCB container, then the iotagent container (as expected), and we get the issue described above.

If we now first do 'docker-compose up -d orion' to start the OCB container, wait a few seconds and then start the other services, we do NOT get the issue described above. This is thus a workaround for the issue.

fgalan commented 4 years ago

Question:

2) OCB process is still alive and listening to port 1026

...and still responing to request on port 1026 (i.e. GET /version, GET /v2/entities/v2)?

MichelSc commented 4 years ago

Orion no longer responds to port 1026.

A "GET /version" gives:

GET http://10.10.10.2:1026/version Error: read ECONNRESET Request Headers User-Agent: PostmanRuntime/7.26.3 Accept: / Postman-Token: f6741cf7-e86b-4edd-bac0-f48f9301c4bc Host: 10.10.10.2:1026 Accept-Encoding: gzip, deflate, br Connection: keep-alive

On the OCB container, executing "ss -tunp | grep 1026" gives

tcp ESTAB 0 0 192.168.16.3:1026 192.168.16.4:51916 users:((...))

repeated (curiously) 1026 times

fgalan commented 4 years ago

Not fully sure, but it seems the problem is not related with Orion itself but with the docker underlying layer. The fact that depending on how do you start the whole setup with docker-compose the issue appears or not seems to point into that direction.

Maybe some docker expert could provide more advice on this on how to debug to find the root cause.

MichelSc commented 4 years ago

The problem could be related to Orion and not to Docker, as it could be caused a too early connection of the iotagent. In our setup, for reproducing the problem, the iotagent is started (by docker) while Orion is still warming up. In the work around, the iotagent is started (by docker) when Orion is up and running. This might explain the difference.

fgalan commented 4 years ago

The problem could be related to Orion and not to Docker, as, it could be caused a too early connection of the iotagent. In our setup, for reproducing the problem, the iotagent is started (by docker) while Orion is still warming up. In the work around, the iotagent is started (by docker) when Orion is up and running. This might explain the difference.

In this case, it is not a problem of Orion itself but with the orchestration in the startup of Orion clients (in this case IOTA). I mean, nothing can be done in Orion code to solve a situation in which a client tries to access Orion before Orion starts.

However, the original issue is not describing this situation, isn't it? You say "after some time, OCB stops handling requests from the iotagent". I mean, the situation happens "after some time", not a startup.

A bit confusing to me :)

MichelSc commented 4 years ago

Sorry for the confusion. The issue happens after some time of working of OCB, so 5 to 15 minutes after the start of both processes. A time depending on the number of messages processed. As if OCB runs out of some resource, like file descriptors. The issue happens only if the itoagent is started too soon after the start of OCB (OCB is started by docker compose, as a result of the dependencies before the iotagent). In that case, the issue will arise after a 5 to 15 minutes of healthy working. If we wait a few seconds after the start of the OCB and then start the iotagent, then no issue.

fgalan commented 4 years ago

Quite a mystery... :)

Not sure how to help with this. I don't have clues about the problem but, anyway, I'm still thinking is not an issue in Orion itself but something in the environment where it runs.

What you say about

As if OCB runs out of some resource, like file descriptors

makes me think in the Orion performance tunning document (at https://fiware-orion.readthedocs.io/en/master/admin/perf_tuning/index.html). I'd suggest to have a look, in case some of the recommendations provided there may help.