ovn-org / ovn-kubernetes

A robust Kubernetes networking platform
https://ovn-kubernetes.io/
Apache License 2.0
818 stars 338 forks source link

Make egress firewall node controller level-driven #4388

Closed npinaeva closed 2 months ago

npinaeva commented 4 months ago

Beginning of the story is in https://github.com/ovn-org/ovn-kubernetes/pull/4386 Switch to level-driven controller to avoid dependency between node handler and ef node handler. If EF handler is ever deadlocked (and it can be as node handler iterates all egress firewalls, which in turn, depend on dns resolvers), node handler should not be affected. See a new test for more idea.

tssurya commented 4 months ago

IIUC the deadlock is the main problem here and ofc the impact is intensified because we also end up blocking the node handling; out of curiosity when did we introduce the deadlock - was it something that was present always and we failed to hit it for a long time?

npinaeva commented 4 months ago

IIUC the deadlock is the main problem here and ofc the impact is intensified because we also end up blocking the node handling; out of curiosity when did we introduce the deadlock - was it something that was present always and we failed to hit it for a long time?

Yeah, a long time ago: around 4.7, it's just difficult to hit it

coveralls commented 3 months ago

Coverage Status

coverage: 52.759% (+0.2%) from 52.563% when pulling 1cd2477dd61e5477b42657bf715d0fde506c5045 on npinaeva:ef-node-level-driven into 036f04d88c4d0f01eeb590ffb5474caf0017435b on ovn-org:master.

jcaamano commented 3 months ago

Beginning of the story is in #4386 Switch to level-driven controller to avoid dependency between node handler and ef node handler. If EF handler is ever deadlocked (and it can be as node handler iterates all egress firewalls, which in turn, depend on dns resolvers), node handler should not be affected. See a new test for more idea.

@npinaeva basically that's the relationship between all our factory handlers for a given resource. If one deadlocks, it affects all others in the same worker. Are you saying that that this is more sensible specifically for this EF node handler (but how? since I understand the original deadlock was fixed)? Or Is it because we don't want to hold the handlers while a potentially delayed DNS resolution happens (which I would have throught is done on a separate thread than the handler's but maybe not)?

npinaeva commented 3 months ago

Beginning of the story is in #4386 Switch to level-driven controller to avoid dependency between node handler and ef node handler. If EF handler is ever deadlocked (and it can be as node handler iterates all egress firewalls, which in turn, depend on dns resolvers), node handler should not be affected. See a new test for more idea.

@npinaeva basically that's the relationship between all our factory handlers for a given resource. If one deadlocks, it affects all others in the same worker. Are you saying that that this is more sensible specifically for this EF node handler (but how? since I understand the original deadlock was fixed)? Or Is it because we don't want to hold the handlers while a potentially delayed DNS resolution happens (which I would have throught is done on a separate thread than the handler's but maybe not)?

Yeah it is a common problem for factory handlers. Generally I would say if you can do an independent level-driven controller, do it. With egress firewall node events, it iterates over all existing egress firewall objects locking each of them. Which by itself is not very efficient, and depends on both egress firewall and dns resolution locking, but also postpones potential next node events handling, which may be important. As it is simple enough to change this part, I figured it makes sense to do.

jcaamano commented 3 months ago

Yeah it is a common problem for factory handlers. Generally I would say if you can do an independent level-driven controller, do it. With egress firewall node events, it iterates over all existing egress firewall objects locking each of them. Which by itself is not very efficient, and depends on both egress firewall and dns resolution locking, but also postpones potential next node events handling, which may be important. As it is simple enough to change this part, I figured it makes sense to do.

OK. It just occurred to me that as we do so, we should be careful to check for any potential dependency between the handler we are replacing and any other handler with different priority. This one looks simple enough.

npinaeva commented 3 months ago

cc @tssurya I think it waits for your approval as you requested changes before

tssurya commented 3 months ago

lgtm on last commit + https://github.com/ovn-org/ovn-kubernetes/pull/4385/commits/ab00a4ba29e181356f826d89a56f3bda186d68ee fixes the node delete as well in the old code so maybe I/you can cherry-pick that one instead (I think Jaime can merge this given he has approved it?); thanks nadia

npinaeva commented 2 months ago

lgtm on last commit + ab00a4b fixes the node delete as well in the old code so maybe I/you can cherry-pick that one instead (I think Jaime can merge this given he has approved it?); thanks nadia

I knew I fixed it somewhere else :D I will update this PR to just add a test then.