telefonicaid / iotagent-json

IoT Agent for a JSON based protocol (with HTTP, MQTT and AMQP transports)
https://fiware-iotagent-json.rtfd.io/
GNU Affero General Public License v3.0
51 stars 88 forks source link

Add feature: Prevention of data loss at IOT agent if n/w failed #397

Open chandradeep11 opened 5 years ago

chandradeep11 commented 5 years ago

If network between IOT agent and Orion (Context Broker) get failed then currently the device measurment sent from Device to IOT agent will be lost. There should be a mechanism support whch will prevent the data loss at IOT agent

onrao commented 3 years ago

Hi

Any update about this feature enablement? We are observing this issue of socket connection lost between IoT Agent JSON and Orion if my IoT Agent data rate is >5000 records per minute. Any pointers to fix this issue or what will be the limitation on data rate supported by single IoT Agent Pod in a K8S deployment environment? looking for a quick solution or fix.

Thanks & Regards ONR

fgalan commented 3 years ago

@Chandradeep-NEC @onrao could you elaborate on which feature are you proposing for IOT Agent? The original issue description is too broad and we should need which mechanism in particular are you proposing to prevent data lost.

onrao commented 3 years ago

@fgalan Observation is that

  1. Except the MEASURE002 or ORION -ALARAM error logging we are not able to know what issue has cause this socket error and how to recover it back.
  2. As it is coming from IoT Agent node lib functionality of updatecontext() we need to explicitly define the socket error due to the orion endpoint not reachable or timeout ..etc
  3. We would like to know when this ALARAM or ERROR will be logged is it due to the more number of update request are happening or more message size has to be updated and what will be the limitation on max number of updates can happen per second/minute from IoT Agent to Orion CB?
mapedraza commented 3 years ago

In my humble opinion what @Chandradeep-NEC is proposing sounds different compared to what you (@onrao) are describing here.

I understand you are facing issues while running on K8s the IoTA when a certain load on the agent (but you are no describing well your infrastructure resources which also limits and the configuration deployed on the IoTA, which really impact to the throughput ). Concerning the list observation you write in your last comment, I can't understand it well.

  1. You are saying you are not able to know when having socket error in the IoTA Log, but you can see it on the Orion log, right?
  2. Could you elaborate more on this topic?
  3. I can not understand this point, do you mind to rephrase it?
onrao commented 3 years ago

@fgalan Sorry for the late reply

  1. We didn't find any error on Orion log.
  2. I mean, as per our understanding Socket error is handled and captured in side the IoT Agent Node Lib. and the log indicates only " Socket open error" No further details are provided why it coudn't open the socket.
  3. We would like to know in which conditions this error will be thrown, is it due to no of requests per sec/minute or timeout is not followed before each session...etc.

as per the documentation it is described as follows, MEASURES-002: COULDN\'T SEND THE UPDATED VALUES TO THE CONTEXT BROKER DUE TO AN ERROR: %S There was some communication error connecting with the Context Broker that made it impossible to send the measures. If this log appears frequently, it may be a signal of network problems between the IoTAgent and the Context Broker. Check the IoTAgent network connection, and the configured Context Broker port and host.

ORION-ALARM | Critical | Indicates a persistent error accessing the Context Broker

There is no further info on this alarm/error.

mapedraza commented 3 years ago

Could you provide a procedure to reproduce the problem in order to analyze what is happening?

onrao commented 3 years ago

FIWARE Stack used

  1. IoT Agent JSON(1.12.0) with MQTT Binding enabled + MQTT Broker
  2. Orion CB(2.6.0)
  3. Draco(~1.3.0) All these are deployed as a docker containers deployed as a service in Kubernetes cluster with 2 Nodes of Midrange VM's. All these services are internally connected with service endpoints in the cluster with auto-scaling enabled.

step1: Provisioned the 12 devices in IoT Agent and generated the ACL for each device step2:Configured VerneMQ to enable the devices ACL and validation accordingly step4: Simulating 12 devices with data rate @5500 data topics/minute step5 : Observed Devices data is loss between IoT Agent and Orion CB but both are running fine on VM Node where as IoT Agent log indicates there is an "MEASURE-002 error/ ORION-ALARM due to the Socket open error"
The Step#4 with @5000 records/topics per minute run for 1 hour duration of simulation , then no ERROR or ALARM triggered inside the EKS pod logs.

We need your quick alternative solution and to know any limitation as per the IoT Agent code , where it is not updated in the document

fgalan commented 3 years ago

Thank you for your feedback, but note that full detail is needed to precisely reproduce your case. Please see me comments inline.

FIWARE Stack used

  1. IoT Agent JSON(1.12.0) with MQTT Binding enabled + MQTT Broker
  2. Orion CB(2.6.0)
  3. Draco(~1.3.0) All these are deployed as a docker containers deployed as a service in Kubernetes cluster with 2 Nodes of Midrange VM's. All these services are internally connected with service endpoints in the cluster with auto-scaling enabled.

Could you provide the exact Kubernets deployment configuration you are using? (helm charts or whatever)

step1: Provisioned the 12 devices in IoT Agent and generated the ACL for each device

How are you provisioning the devices? Could you provide the curl command you are using (or the equivalent in curl to the command you are using)? How are you generating the ACLs? Could you provide the curl command you are using (or the equivalent in curl to the command you are using)?

step2:Configured VerneMQ to enable the devices ACL and validation accordingly

What is VerneMQ?

step4: Simulating 12 devices with data rate @5500 data topics/minute

How do you simulate this? Could you provide the script program or similar (e.g. JMeter configuration, etc.) you are using?

step5 : Observed Devices data is loss between IoT Agent and Orion CB but both are running fine on VM Node where as IoT Agent log indicates there is an "MEASURE-002 error/ ORION-ALARM due to the Socket open error" The Step#4 with @5000 records/topics per minute run for 1 hour duration of simulation , then no ERROR or ALARM triggered inside the EKS pod logs.

We need your quick alternative solution and to know any limitation as per the IoT Agent code , where it is not updated in the document

onrao commented 3 years ago

@fgalan please find the required details.We need a quick confirmation and way forward for this issue.

Environment Setup: image

Error: image

Device Simulator: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Device_ID | Crane1_1_S |   |   |   |   |   |   |   |   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- No_of_Sensors | 9 |   |   |   |   |   |   |   |   Activity_Duration_mins | 240 |   |   |   |   |   |   |   |   Attribute | Type | Active(Y/N) | Range/Value | minvalue | maxvalue | Unit | Set_No | Frequency_secs | Topic liftingHeight | number | N | Range | 0 | 0 | m | 1 | 1 | /iot/Crane1_1_S/attrs windingSpeed | number | N | Range | 0 | 0 | m/s | 1 | 1 | /iot/Crane1_1_S/attrs load | number | N | Range | 10.4 | 10.4 | N | 1 | 1 | /iot/Crane1_1_S/attrs turningAngle | number | N | Range | 10 | 10 | ° | 1 | 1 | /iot/Crane1_1_S/attrs turningspeed | number | N | Range | 12.5 | 12.5 | °/s | 1 | 1 | /iot/Crane1_1_S/attrs motorCurrent | number | Y | Range | 0 | 10 | A | 1 | 1 | /iot/Crane1_1_S/attrs brake | text | N | Range | 0 | 0 |   | 1 | 1 | /iot/Crane1_1_S/attrs hoist | text | N | Range | 9.7 | 9.7 |   | 1 | 1 | /iot/Crane1_1_S/attrs hoistCoolingFan | text | N | Range | 0 | 0 |   | 1 | 1 | /iot/Crane1_1_S/attrs Device_ID | Crane1_2_S |   |   |   |   |   |   |   |   No_of_Sensors | 9 |   |   |   |   |   |   |   |   Activity_Duration_mins | 240 |   |   |   |   |   |   |   |   Attribute | Type | Active(Y/N) | Range/Value | minvalue | maxvalue | Unit | Set_No | Frequency_secs | Topic liftingHeight | number | N | Range | 0 | 0 | m | 1 | 1 | /iot/Crane1_2_S/attrs windingSpeed | number | N | Range | 0 | 0 | m/s | 1 | 1 | /iot/Crane1_2_S/attrs load | number | N | Range | 10.4 | 10.4 | N | 1 | 1 | /iot/Crane1_2_S/attrs turningAngle | number | N | Range | 10 | 10 | ° | 1 | 1 | /iot/Crane1_2_S/attrs turningspeed | number | N | Range | 12.5 | 12.5 | °/s | 1 | 1 | /iot/Crane1_2_S/attrs motorCurrent | number | Y | Range | 0 | 10 | A | 1 | 1 | /iot/Crane1_2_S/attrs brake | text | N | Range | 0 | 0 |   | 1 | 1 | /iot/Crane1_2_S/attrs hoist | text | N | Range | 9.7 | 9.7 |   | 1 | 1 | /iot/Crane1_2_S/attrs hoistCoolingFan | text | N | Range | 0 | 0 |   | 1 | 1 | /iot/Crane1_2_S/attrs Device_ID | Crane1_3_S |   |   |   |   |   |   |   |   No_of_Sensors | 9 |   |   |   |   |   |   |   |   Activity_Duration_mins | 240 |   |   |   |   |   |   |   |   Attribute | Type | Active(Y/N) | Range/Value | minvalue | maxvalue | Unit | Set_No | Frequency_secs | Topic liftingHeight | number | N | Range | 0 | 0 | m | 1 | 1 | /iot/Crane1_3_S/attrs windingSpeed | number | N | Range | 0 | 0 | m/s | 1 | 1 | /iot/Crane1_3_S/attrs load | number | N | Range | 10.4 | 10.4 | N | 1 | 1 | /iot/Crane1_3_S/attrs turningAngle | number | N | Range | 10 | 10 | ° | 1 | 1 | /iot/Crane1_3_S/attrs turningspeed | number | N | Range | 12.5 | 12.5 | °/s | 1 | 1 | /iot/Crane1_3_S/attrs motorCurrent | number | Y | Range | 0 | 10 | A | 1 | 1 | /iot/Crane1_3_S/attrs brake | text | N | Range | 0 | 0 |   | 1 | 1 | /iot/Crane1_3_S/attrs hoist | text | N | Range | 9.7 | 9.7 |   | 1 | 1 | /iot/Crane1_3_S/attrs hoistCoolingFan | text | N | Range | 0 | 0 |   | 1 | 1 | /iot/Crane1_3_S/attrs Device_ID | Crane2_1_S |   |   |   |   |   |   |   |   No_of_Sensors | 9 |   |   |   |   |   |   |   |   Activity_Duration_mins | 240 |   |   |   |   |   |   |   |   Attribute | Type | Active(Y/N) | Range/Value | minvalue | maxvalue | Unit | Set_No | Frequency_secs | Topic liftingHeight | number | N | Range | 0 | 0 | m | 1 | 1 | /iot/Crane2_1_S/attrs windingSpeed | number | N | Range | 0 | 0 | m/s | 1 | 1 | /iot/Crane2_1_S/attrs load | number | N | Range | 10.4 | 10.4 | N | 1 | 1 | /iot/Crane2_1_S/attrs turningAngle | number | N | Range | 10 | 10 | ° | 1 | 1 | /iot/Crane2_1_S/attrs turningspeed | number | N | Range | 12.5 | 12.5 | °/s | 1 | 1 | /iot/Crane2_1_S/attrs motorCurrent | number | Y | Range | 0 | 10 | A | 1 | 1 | /iot/Crane2_1_S/attrs brake | text | N | Range | 0 | 0 |   | 1 | 1 | /iot/Crane2_1_S/attrs hoist | text | N | Range | 9.7 | 9.7 |   | 1 | 1 | /iot/Crane2_1_S/attrs hoistCoolingFan | text | N | Range | 0 | 0 |   | 1 | 1 | /iot/Crane2_1_S/attrs

fgalan commented 3 years ago

I'm afraid you last comment doesn't correspond to what I asked...

With regards to kubernetes configuration, an screenshoot is not acceptable. Could you provide the files in textual form, please?

With regards to the simulation information, a similar problem occurs. I don't know what that table means... you don't even mention which simulation tool are you using. Please, provide more detail on this.

Finally, we would need an answer (as direct and detailed as you can) to the following questions:

onrao commented 2 years ago

@fgalan We saw the issue already reported here https://github.com/telefonicaid/iotagent-ul/issues/383 is same as what we faced. But it is still open. Can we have any suggestions from FIWARE community?

The above verneMQ what we used is the MQTT Broker.

fgalan commented 2 years ago

@onrao thanks for the feedback!

We will be more than happy to review and eventually merge any pull request that identifies and solves the problem, if at the end it is confirmed. Thanks! :)

vijapandey commented 1 year ago

Reviewing all comments but not able to conclude about the valid solution... Could @fgalan please suggest about solution of above issue... I am having same problem.

fgalan commented 1 year ago

I'm afraid I cannot provide any solution because, honestly, I'm a bit lost in this issue and I don't know what the exact problem is :)

If somebody could summarize and explain the exact problem and the proposed solution it would be great. Based on that I could provide better feedback.