thom311 / libnl

Netlink Library Suite
GNU Lesser General Public License v2.1
419 stars 311 forks source link

route: fix IPv6 ecmp route deleted nexthop matching #382

Closed KanjiMonster closed 4 months ago

KanjiMonster commented 4 months ago

When the kernel sends a ECMP route update with just the deleted nexthop, the nexthop will have no associated weight, and its flags may indicate that it is dead:

    route_update: RTM_DELROUTE
    new route:
    inet6 default table main type unicast <DEAD,>
        scope global priority 0x400 protocol 0x9
        nexthop via fe80::b226:28ff:fe62:8841 dev port4 <dead,>
    old route:
    inet6 default table main type unicast
        scope global priority 0x400 protocol 0x9
        nexthop via fe80::b226:28ff:fe62:8841 dev port4 weight 0 <>
        nexthop via fe80::fa8e:a1ff:fee0:8344 dev port49 weight 0 <>
        nexthop via fe80::b226:28ff:fe62:d400 dev port3 weight 0 <>
        nexthop via fe80::fa8e:a1ff:fee0:8349 dev port54 weight 0 <>

Since we are comparing the nexthops strictly with all attributes, we can never match the deleted nexthop. This causes libnl to fail to remove the deleted nexthop from the route, and consequently send out a nop-update and a desync of the route in the cache and in the kernel.

Fix this by ignoring NH_ATTR_FLAGS (0x1) and NH_ATTR_WEIGHT (0x2) when comparing nexthops to properly match the deleted one.

Fixes: 29b71371e764 ("route cache: Fix handling of ipv6 multipath routes")

KanjiMonster commented 4 months ago

The output were from some strategically places debug prints at the top of route_update(). No idea if this is a changed behavior in recent kernels, or if it was always broken. This is with kernel 6.6 FWIW.

I did not touch adding a nexthop, since I didn't observe this case while testing/debugging, and I was unsure if it is fixable, since we would be then potentially adding a nexthop with no associated weight (since new "added" route isn't a multipath route?).

All added nexthops had routes with all nexthops in new, so I'm not even sure if this is a case that can happen (anymore).

thom311 commented 4 months ago

Fixes: 5c3aea338b37 ("route: fix check for ecmp route nexthop deletion")

I don't find this reference. Do you have a better one?

I think it should be just ~3. It should be ~((uint32_t) (NH_ATTR_FLAGS | NH_ATTR_WEIGHT)).

Note that there are many problems with caching of the routes in libnl3. In general, and specifically to IPv6 ECMP routes. Anyway, the patch doesn't make it worse, so ok!

Thank you.

thom311 commented 4 months ago

I think it should be just ~3. It should be ~((uint32_t) (NH_ATTR_FLAGS | NH_ATTR_WEIGHT)).

It also would warrant a code comment, about why those are ignored.

KanjiMonster commented 4 months ago

Fixes: 5c3aea338b37 ("route: fix check for ecmp route nexthop deletion")

I don't find this reference. Do you have a better one?

Oops, looks like I copied the wrong commit hash. Updated with the correct one.

I think it should be just ~3. It should be ~((uint32_t) (NH_ATTR_FLAGS | NH_ATTR_WEIGHT)).

These are currently non-accessible in lib/route/nexthop.c, which is why I opted for a magic value to keep the change small. We could move the NH_ATTR_* defines into lib/netlink/route/nexhop.h though. Should I do that?

Regardless, I agree it needs a comment explaining the reasoning.

Note that there are many problems with caching of the routes in libnl3. In general, and specifically to IPv6 ECMP routes. Anyway, the patch doesn't make it worse, so ok!

Do you have any specific issues in mind? I'm open to taking a stab at them, since I'm currently working on trying to work with IPv6 ECMP routes, so knowing about the issues (and maybe even fixing them) would be nice.

thom311 commented 4 months ago

These are currently non-accessible in lib/route/nexthop.c, which is why I opted for a magic value to keep the change small. We could move the NHATTR* defines into lib/netlink/route/nexhop.h though. Should I do that?

Ah right. nexthop.h is a public header, so better not there. lib/route/nl-route.h could be the right place (and rename to NL_ROUTE_NEXTHOP_ATTR_*).

But arguably, a code comment also suffices explaining what the 3 stands for.

Do you have any specific issues in mind?

for one, that the identity operator of routes is wrong. See for example oo_compare and oo_id_attrs in route_obj_ops. It would be closer to what NetworkManager does here and here.

Then, handling ip route replace (NLM_F_REPLACE) is not correct. The flag only tells that another route was replaced, but we don't immeditely know which one. Theoretically, we would need to keep the routes sorted in the order in which we receive them (ip route order) and make sure that events properly APPEND/PREPEND in that order. For NetworkManager, I gave up on that, it's basically impossible to get this right based on the rtnetlink events. If you get it wrong, the cache is inconsistent. It's better on every replace to assume we don't know the cache's content anymore, and re-sync. It's expensive, but correct.

Finally, as you have noticed, IPv6 ECMP routes are represented quite oddly on the rtnetlink API. In NetworkManger's cache, a IPv6 route is always a singlehop route. The twist is only that a single RTM_NEWROUTE/RTM_DELROUTE can contain multiple next hops, which however are interpreted as as if receiving a list of multiple single-hop events here.

All in all, it is IMO not possible to maintain a consistent cache based on the current events from kernel. NetworkManager gets it mostly right, by triggering expensive sync-all. It also probably won't fail badly in face of a cache inconsistancy.

Patch welcome. But it will require a large effort.

thom311 commented 4 months ago

also, there are cases where kernel will remove a route and not send a RTM_DELROUTE event. Because the user is supposed to understand that from circumstantial events.

NetworkManager doesn't want to to do that (make such circumstantial decisions based on potentially invalid cache content). Instead, in such cases it also triggers a re-sync (here. But that doesn't really work for libnl, because the libnl cache is about routes only. Somehow a even for addresses would trigger a resync for the route cache.

KanjiMonster commented 4 months ago

These are currently non-accessible in lib/route/nexthop.c, which is why I opted for a magic value to keep the change small. We could move the NHATTR* defines into lib/netlink/route/nexhop.h though. Should I do that?

Ah right. nexthop.h is a public header, so better not there. lib/route/nl-route.h could be the right place (and rename to NL_ROUTE_NEXTHOP_ATTR_*).

But arguably, a code comment also suffices explaining what the 3 stands for.

I'll add a comment explaining it.

Though I feel moving the _ATTR_ defines to the public API would be arguable, since we expose a _compare() function that takes a mask, but we never provide any definition for the values you can pass there, so ~0 is the only value option you currently pass.

But if we do that, it should be done for all objects, not just nexthops.

Do you have any specific issues in mind?

for one, that the identity operator of routes is wrong. See for example oo_compare and oo_id_attrs in route_obj_ops. It would be closer to what NetworkManager does here and here.

Then, handling ip route replace (NLM_F_REPLACE) is not correct. The flag only tells that another route was replaced, but we don't immeditely know which one. Theoretically, we would need to keep the routes sorted in the order in which we receive them (ip route order) and make sure that events properly APPEND/PREPEND in that order. For NetworkManager, I gave up on that, it's basically impossible to get this right based on the rtnetlink events. If you get it wrong, the cache is inconsistent. It's better on every replace to assume we don't know the cache's content anymore, and re-sync. It's expensive, but correct.

Finally, as you have noticed, IPv6 ECMP routes are represented quite oddly on the rtnetlink API. In NetworkManger's cache, a IPv6 route is always a singlehop route. The twist is only that a single RTM_NEWROUTE/RTM_DELROUTE can contain multiple next hops, which however are interpreted as as if receiving a list of multiple single-hop events here.

All in all, it is IMO not possible to maintain a consistent cache based on the current events from kernel. NetworkManager gets it mostly right, by triggering expensive sync-all. It also probably won't fail badly in face of a cache inconsistancy.

Patch welcome. But it will require a large effort.

Ah, that indeed sounds like a lot of work.

also, there are cases where kernel will remove a route and not send a RTM_DELROUTE event. Because the user is supposed to understand that from circumstantial events.

Right, I already noticed that deleting a nexthop created via the nexthop API will silently delete all (non-ECMP) routes using it.

To fix that, we would need to allow nexthop updates to update the route cache. And probably similar triggers for other events that causes silent route deletion.

So far our main use case (for ECMP routes) is listening to netlink and apply routes created by FRR to a switch ASIC, and FRR is nice enough to delete routes before deleting nexthops, for which the current implementation is sufficient.

thom311 commented 4 months ago

Though I feel moving the ATTR defines to the public API would be arguable, since we expose a _compare() function that takes a mask, but we never provide any definition for the values you can pass there, so ~0 is the only value option you currently pass.

True, but IMO that is more of an API ugliness of the compare functions. I don't think that all ATTR flags should become public API.

It's partly because libnl-3-route.so uses libnl-3.so as a shared library, so it needs to be able to specify those flags. But external users, don't really have anything to do with them.

thom311 commented 4 months ago

To fix that, we would need to allow nexthop updates to update the route cache. And probably similar triggers for other events that causes silent route deletion.

Oh, I wasn't aware that changes to nexthops can cause (silent) changes to the routes. That's seems bad. NetworkManager doesn't watch the nexthops and would miss such changes. Oh well...

KanjiMonster commented 4 months ago

Added a comment, and changed the ~3 to a ~0x3 to make it more clear this is a bitmask.

Oh, I wasn't aware that changes to nexthops can cause (silent) changes to the routes. That's seems bad. NetworkManager doesn't watch the nexthops and would miss such changes. Oh well...

IIRC this only applies to IPv4, not IPv6, but if you create a nexthop, then add a route via this nexthop (via nhid), then delete the nexthop, you will get a netlink notification about then nexthop being deleted, but not the route itself.

True, but IMO that is more of an API ugliness of the compare functions. I don't think that all ATTR flags should become public API.

There's another place where having access to the ATTR flags would be benefical: when using the callbacks for cache changes, libnl passes a diff bitmask to the callback on NL_ACT_CHANGE notifications. Having access to ATTR flags would allow users to know which values changed without having to compare each attribute between the old and new object themselves (again).

thom311 commented 4 months ago

There's another place where having access to the ATTR flags would be benefical

Yes, I am not saying it's useless.

But it would mean to expose (and commit) to quite a large new API (all the attributes), which limits future changes. It's high effort to expose this, and to maintain. Unless there is a strong request, better don't...

KanjiMonster commented 4 months ago

But it would mean to expose (and commit) to quite a large new API (all the attributes), which limits future changes. It's high effort to expose this, and to maintain. Unless there is a strong request, better don't...

Right, so if I would propose doing it, I would only do so with an actual user. Our product currently does all the comparisons itself, and would profit from having access to the meaning of the diff, but we currently don't have the resources for the rather large conversion to change this without an actual gain in functionality, just efficiency.

thom311 commented 4 months ago

merged. Thank you!!!

thom311 commented 4 months ago

Right, so if I would propose doing it, I would only do so with an actual user. Our product currently does all the comparisons itself, and would profit from having access to the meaning of the diff, but we currently don't have the resources for the rather large conversion to change this without an actual gain in functionality, just efficiency.

It might not be that much effort. Just move the defines to public headers, and give them a consistent, good name. E.g. NL_OBJATTR_. The main burden comes from in the future not breaking that API (but that's doable). If you wish, patch welcome (start small).