lf-edge / eve

EVE is Edge Virtualization Engine
https://www.lfedge.org/projects/eve/
Apache License 2.0
478 stars 164 forks source link

Fix bugs introduced when support for multi-port NI was added #4221

Closed milan-zededa closed 2 months ago

milan-zededa commented 2 months ago

As always, new features also introduce new bugs. Here is a PR with patches to address all bugs identified since the support for Local NI with multiple port was merged. Actually, one of them was introduced by configurable flow-logging, which was merged around the same time...

Avoid unecessary re-creation of static NI routes

We may have some discrepancies between the intended and the current state of NI static routes mostly caused by Linux kernel being smart and automatically setting some route fields which we leave undefined. From the reconciler point of view, the current vs. the intended route config therefore differ and it will unecessarily recreate the route. Let's therefore set all those automatically filled route fields already in the intended config (or ignore them in the comparison function) to avoid unecessary route recreations.

Properly sync routes of NI with multiple ports

When we are syncing the current state of the NI routing table, we should list all the routes in the table, even those with an output port no longer used by NI. Otherwise, if user removes port from NI (through a change in shared labels), current state re-sync will incorrectly omit obsolete routes and reconciler will thus not remove them (because they already appear removed from the reconciler point of view).

Allow ICMP and server-initiated DHCP traffic to enter device

Now that the flow logging is configurable, we no longer rely on the connection marking for device INPUT rules. Instead we use ACCEPT/DROP rules inside the filter table. But with this change, we forgot to add ICMP and DHCP rule to the filter/INPUT-device chain. As a consequence, device will for example not reply to ICMP pings... As for DHCP, most of the traffic is initiated by client and will match the "ctstate RELATED,ESTABLISHED" rule with the ACCEPT action, so this is not that much affected.

Avoid returning "< nil >" or zero UUID as route gateway IP/app

IP Route Gateway IP and Gateway App can be each empty (for e.g. connected route). But EVE should avoid publishing "< nil >" or zero UUID inside the NetworkInstanceInfo message.

OhmSpectator commented 2 months ago

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

milan-zededa commented 2 months ago

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

That's a very philosophical question. I don't think that I'm ready to answer it :)

OhmSpectator commented 2 months ago

It's interesting. Why don't I see the running jobs here in the status? Everything is just green. But when I go to actions, I see the test suits are progressing.

OhmSpectator commented 2 months ago

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

That's a very philosophical question. I don't think that I'm ready to answer it :)

Ok, then, may I ask how you found these issues? =)

milan-zededa commented 2 months ago

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

That's a very philosophical question. I don't think that I'm ready to answer it :)

Ok, then, may I ask how you found these issues? =)

Actually most of them were found by other people in our company who tried EVE 13.0.0 and reported problems to me (as the default destination for networking issues).

OhmSpectator commented 2 months ago

@milan-zededa do you see how we could avoid merging the code that causes these fixes? Or do you think it's inevitable?

That's a very philosophical question. I don't think that I'm ready to answer it :)

Ok, then, may I ask how you found these issues? =)

Actually most of them were found by other people in our company who tried EVE 13.0.0 and reported problems to me (as the default destination for networking issues).

Tricky... So, we can consider it as in-hause post-merge integration testing...

OhmSpectator commented 2 months ago

@yash-zededa, do you have an idea why the Eden tests are skipped in this PR?...

yash-zededa commented 2 months ago

@yash-zededa, do you have an idea why the Eden tests are skipped in this PR?...

The tests were triggered on my approval. Checking what happened when you had approved.

yash-zededa commented 2 months ago

@OhmSpectator the test was triggered here

yash-zededa commented 2 months ago

@OhmSpectator the tests execution is aborted by GH and it is quite weird.

Seems to be related to concurrency, checking..

image
yash-zededa commented 2 months ago

Details

The concurrency is set up to run based on the PR number, so if there's an existing run for the same PR, the previous job should be canceled.

I'll keep an eye on all the executions of the lf-edge Test Suite.

OhmSpectator commented 2 months ago

Wow, all green