Closed npinaeva closed 2 months ago
IIUC the deadlock is the main problem here and ofc the impact is intensified because we also end up blocking the node handling; out of curiosity when did we introduce the deadlock - was it something that was present always and we failed to hit it for a long time?
IIUC the deadlock is the main problem here and ofc the impact is intensified because we also end up blocking the node handling; out of curiosity when did we introduce the deadlock - was it something that was present always and we failed to hit it for a long time?
Yeah, a long time ago: around 4.7, it's just difficult to hit it
Beginning of the story is in #4386 Switch to level-driven controller to avoid dependency between node handler and ef node handler. If EF handler is ever deadlocked (and it can be as node handler iterates all egress firewalls, which in turn, depend on dns resolvers), node handler should not be affected. See a new test for more idea.
@npinaeva basically that's the relationship between all our factory handlers for a given resource. If one deadlocks, it affects all others in the same worker. Are you saying that that this is more sensible specifically for this EF node handler (but how? since I understand the original deadlock was fixed)? Or Is it because we don't want to hold the handlers while a potentially delayed DNS resolution happens (which I would have throught is done on a separate thread than the handler's but maybe not)?
Beginning of the story is in #4386 Switch to level-driven controller to avoid dependency between node handler and ef node handler. If EF handler is ever deadlocked (and it can be as node handler iterates all egress firewalls, which in turn, depend on dns resolvers), node handler should not be affected. See a new test for more idea.
@npinaeva basically that's the relationship between all our factory handlers for a given resource. If one deadlocks, it affects all others in the same worker. Are you saying that that this is more sensible specifically for this EF node handler (but how? since I understand the original deadlock was fixed)? Or Is it because we don't want to hold the handlers while a potentially delayed DNS resolution happens (which I would have throught is done on a separate thread than the handler's but maybe not)?
Yeah it is a common problem for factory handlers. Generally I would say if you can do an independent level-driven controller, do it. With egress firewall node events, it iterates over all existing egress firewall objects locking each of them. Which by itself is not very efficient, and depends on both egress firewall and dns resolution locking, but also postpones potential next node events handling, which may be important. As it is simple enough to change this part, I figured it makes sense to do.
Yeah it is a common problem for factory handlers. Generally I would say if you can do an independent level-driven controller, do it. With egress firewall node events, it iterates over all existing egress firewall objects locking each of them. Which by itself is not very efficient, and depends on both egress firewall and dns resolution locking, but also postpones potential next node events handling, which may be important. As it is simple enough to change this part, I figured it makes sense to do.
OK. It just occurred to me that as we do so, we should be careful to check for any potential dependency between the handler we are replacing and any other handler with different priority. This one looks simple enough.
cc @tssurya I think it waits for your approval as you requested changes before
lgtm on last commit + https://github.com/ovn-org/ovn-kubernetes/pull/4385/commits/ab00a4ba29e181356f826d89a56f3bda186d68ee fixes the node delete as well in the old code so maybe I/you can cherry-pick that one instead (I think Jaime can merge this given he has approved it?); thanks nadia
lgtm on last commit + ab00a4b fixes the node delete as well in the old code so maybe I/you can cherry-pick that one instead (I think Jaime can merge this given he has approved it?); thanks nadia
I knew I fixed it somewhere else :D I will update this PR to just add a test then.
Beginning of the story is in https://github.com/ovn-org/ovn-kubernetes/pull/4386 Switch to level-driven controller to avoid dependency between node handler and ef node handler. If EF handler is ever deadlocked (and it can be as node handler iterates all egress firewalls, which in turn, depend on dns resolvers), node handler should not be affected. See a new test for more idea.