smartfog / fogflow

FogFlow is a standard-based IoT fog computing framework that supports serverless computing and edge computing with advanced programming models
https://fogflow.readthedocs.io
BSD 3-Clause "New" or "Revised" License
123 stars 45 forks source link

maintain the worker list according to their heartbeat information #338

Closed smartfog closed 2 years ago

smartfog commented 2 years ago

problem: master still keeps the inactive worker in the scheduling list. solution: master checks the liveness of each worker based on the last heartbeat message from each worker

showersky commented 2 years ago

fixed

naveennec commented 2 years ago

@showersky its giving issue, existing worker is getting removed so master is not able to get the worker. team tested both NGSI-V1 use cases as well as NGSI-LD use cases- giving problem in both. Below are the masters logs. DEBUG: 2022/01/24 21:32:21 taskMgr.go:310: {Task.anomaly-detection.Detector.1254725908 Detector anomaly scheduled [{Device.PowerPanel.01 PowerPanel [] {35.7 138}} {Stream.Rule.01 Rule [] {0 0}}] [0xc000166180]} DEBUG: 2022/01/24 21:32:21 taskMgr.go:311: hashID 1254725908, taskID Task.anomaly-detection.Detector.1254725908 INFO: 2022/01/24 21:32:21 add task {ID:Task.anomaly-detection.Detector.1254725908 ServiceName:anomaly-detection TaskName:Detector OperatorName:anomaly TaskType: FunctionCode: DockerImage: Parameters:[] WorkerID: IsExclusive:false PriorityLevel:50 Status:scheduled Inputs:[{Type:PowerPanel ID:Device.PowerPanel.01 FiwareServicePath: MsgFormat: AttributeList:[]} {Type:Rule ID:Stream.Rule.01 FiwareServicePath: MsgFormat: AttributeList:[]}] Outputs:[{Type:Anomaly StreamID:Anomaly.1254725908.1 Annotations:[]}]} Locations ** [{35.7 138} {0 0}] &&&& len(locations) &&&&&&&&& 2 DEBUG: 2022/01/24 21:32:21 master.go:879: points: [{Latitude:35.7 Longitude:138} {Latitude:0 Longitude:0}] &&&& master.workers &&&&&& map[] selectedWorkerID ** ERROR: 2022/01/24 21:32:21 taskMgr.go:894: ==NOT ABLE TO FIND A WORKER FOR THIS TASK=== workerID and profile ** Worker.001 &{Worker.001 {35 139} linux amd64 8 0 0 0001-01-01 00:00:00 +0000 UTC} INFO: 2022/01/24 21:32:27 REMOVE worker Worker.001 from the list

showersky commented 2 years ago

I think this is related to the time interval of checking liveness of workers and the reporting time interval of heartbeat messages from each worker.

See the setting of these two time intervals:

20 MAX_HEARTBEAT_DURATION ==> 60 seconds

https://github.com/smartfog/fogflow/blob/development/master/master.go

10 seconds

w.ticker = time.NewTicker(time.Second * 10)

In addition, please make sure that both master and workers are upgraded to the latest version in your test.

If both are already the latest version, maybe the settings of time intervals are still too sensitive due to the network delay and message buffering time of RabbitMQ.

naveennec commented 2 years ago

I have little doubt in code . how Last_Hearbeat_Update is setting. I hope this must contain the last heart time stamp but in code I didn't get where Last_Hearbeat_Update is getting updated (in development branch). please correct in case wrong understanding.

showersky commented 2 years ago

Sorry! Yes, you are right. I forgot to commit the other two lines of changes. Now they are committed.

I thought this was just some small change so that I did not create a new PR for this fix. I will avoid this in the future.

Please have a double check if the issue is resolved by the latest master and worker.

naveennec commented 2 years ago

Yes now use cases are working fine. Thanks.