opiproject / opi-evpn-bridge

OPI gRPC to EVPN GW FRR bridge
Apache License 2.0
14 stars 13 forks source link

Design and implement a proper way to handle Pending objects when a module is not replying. #388

Open mardim91 opened 4 months ago

mardim91 commented 4 months ago

When a object (VRF or LB or anyother object) gets created and one of the modules is stuck and do not return with success or error then the UpdateStatus function is not called and as a result the task in task manager is requeued immediately. This approach presents several problems that are listed below:

  1. The Publish function is a using a blocking channel. Which means that if the module cannot empty the channel where the notification is sent then the task manager cannot publish the next notification when the task gets expired and requeued. One solution here is to make the publish function non-blocking. How to do that properly we need to design it. This is a serious bug as the task manager cannot publish and the module cannot call the Updatefunction and return as the task manager cannot move forward and empty the TaskStatus channel because it is stuck in the publish function. This can make the opi-evpn-bridge unresponsive and we need to restart it.

  2. When a pending task gets expired we requeue it immediately. That means that if the module is stuck the task will expire and requeued many times without any exponential back off timer. This is not good because we can overload the publish function and the module itself. If we implement any exponential back off timer for this Pending tasks then we need to make the TaskStatus channel unblocking as we can have a situation that the module after a long time unstucks itself the call the UpdateStatus function but because the task has not been requeued as it waits on the timer to expire the queue is empty and that means that the TaskStatus will not be read by task manager in order to read whatever the module has sent as status.