vishvananda / netlink

Simple netlink library for go.
Apache License 2.0
2.82k stars 737 forks source link

Preserve results when NLM_F_DUMP_INTR is set #1018

Closed robmry closed 2 weeks ago

robmry commented 4 weeks ago

When the kernel detects a change in a set of resources being dumped (walked-over) by a netlink call, it sets flag NLM_F_DUMP_INTR to indicate that the results may be incomplete or inconsistent.

Before https://github.com/vishvananda/netlink/pull/925 (in v1.2.1), the flag was ignored and results were returned without an error. With that change, response handling is aborted, results are discarded, and unix.EINTR is returned.

Depending on what the caller is looking for, the incomplete/inconsistent results may be usable - and that may be preferable to retrying the whole request (which could take unbounded time on a system that's seeing a lot of changes in a lot of data).

This PR introduces a new error, netlink.ErrDumpInterrupted that's specific to the NLM_F_DUMP_INTR case. Response handling completes as normal, and data is returned along with the error. So, the caller can use something like errors.Is(err, netlink.ErrDumpInterrupted), and take appropriate action.

If any callers are checking for errors.Is(err, unix.EINTR), that'll continue to work. But, this is a breaking change for any callers checking the error with a simple err == unix.EINTR.

robmry commented 4 weeks ago

@adrianchiris, @aboch ... could you take a look?

adrianchiris commented 3 weeks ago

@robmry just as an FYI, the check was introduced to align with how libmnl is executing netlink commands. see [1]

can you provide more info on the use-case where its ok and needed to rely partial results instead of retry ? (except the ci failure u linked).

TBH i favor sticking with how iproute2/libmnl implements execution of these cmds unless there is a good reason to deviate

[1] https://github.com/justmirror/libmnl/blob/f14732339a77a2e6ba9ed4ca99b347ce1dc60801/src/callback.c#L70

robmry commented 3 weeks ago

@robmry just as an FYI, the check was introduced to align with how libmnl is executing netlink commands. see [1]

can you provide more info on the use-case where its ok and needed to rely partial results instead of retry ? (except the ci failure u linked).

TBH i favor sticking with how iproute2/libmnl implements execution of these cmds unless there is a good reason to deviate

[1] https://github.com/justmirror/libmnl/blob/f14732339a77a2e6ba9ed4ca99b347ce1dc60801/src/callback.c#L70

Thanks @adrianchiris - the moby (dockerd) maintainers were concerned about unbounded retries on the new EINTR. It's not quite like a retry on a normal EINTR from a system call as it's (perhaps) more likely to happen again - and the kernel interface gives the option of using the partial data.

So, the preference is to be able to retry a limited number of times and, in the unlikely event the retries fail, fall-back to using the partial data - on the basis that it'll be no worse than it was before the change, and no problems have been noticed/reported.

As an example of where partial data might be used in a more considered way - in this package, LinkByName can fall back to using linkByNameDump, which searches a dump for the name. If there's no result with the name, there's no way to tell if the link exists without a retry. But, if there's at-least one entry with the name, it might be what the caller wants ... this package can't tell. Not-discarding the result and returning a Link with the error means the caller can decide - it might want to retry, use the Link as-is, or it might want to get a more up-to-date version using the returned Index in LinkByIndex (which doesn't do a dump).

Many callers are still likely to see the error is non-nil and just take an error path without looking at the data.

iproute2 does something similar here https://github.com/iproute2/iproute2/blob/e9096586e0701d5ae031df2f2708d20d34ae7bd4/lib/libnetlink.c#L901 ... it doesn't retry on NLM_F_DUMP_INTR, but it's not a daemon and the user can choose to retry when they get an error.

aboch commented 3 weeks ago

Yes, the goal of this library is to be in sync with iproute2 behavior as much as possible. If we start to add exceptions to accommodate this or this other project, we will end up introducing a lot of oddities in the code which will hinder the use of this library in future.

Per go guidelines, but in general a good practice for any language, if the the function returns an non nil error, any other returned variable must not be trusted.

The fact that you have each list method to still return the original INTR error is good, but if the parsing functions fail because the partial list is truncated at the element level (I do not know if it is possible) then the parsing function will fail and the returned error will be a parsing error. This will mask the real INTR error, making it harder to root cause.

robmry commented 3 weeks ago

Yes, the goal of this library is to be in sync with iproute2 behavior as much as possible. If we start to add exceptions to accommodate this or this other project, we will end up introducing a lot of oddities in the code which will hinder the use of this library in future.

The iproute2 link I pasted above shows it handling NLM_F_DUMP_INTR as it's handled by this change.

There's nothing moby-specific about the change, or requirement ... it's discussed in my comment because I was responding to the request for a use-case, and moby is one. But, the LinkByName example is based on how it could be used by a function in this package (or its caller).

Per go guidelines, but in general a good practice for any language, if the the function returns an non nil error, any other returned variable must not be trusted.

I can't find anything like that in https://go.dev/doc/effective_go#errors - I also thought it was unusual, but io.ReadAll is a counter-example.

In this case, the error is very explicit that the results aren't to be trusted. The idea is to give the caller the choice (like the kernel and iproute2 do).

The fact that you have each list method to still return the original INTR error is good, but if the parsing functions fail because the partial list is truncated at the element level (I do not know if it is possible) then the parsing function will fail and the returned error will be a parsing error. This will mask the real INTR error, making it harder to root cause.

I don't think that's possible - individual results from a collection are copied into the netlink response with locks held, so they'll be fine. The flag indicates that a generation number for the collection changed as those results were being accumulated. (A reference to a position in the collection is stored between netlink responses to a request, but that position can get out of date. Even with a single netlink response message, collections can be changed between reads of elements of the collection.)

corhere commented 3 weeks ago

if the parsing functions fail because the partial list is truncated at the element level (I do not know if it is possible) then the parsing function will fail and the returned error will be a parsing error. This will mask the real INTR error, making it harder to root cause.

Before v1.2.1, NLM_F_DUMP_INTR flag was not checked when parsing messages. If it was possible for the kernel to return a truncated list, someone would have seen and complained about parsing errors in the same circumstances where we are currently complaining about EINTR. This library itself is very strong evidence that the kernel does not pass ill-formed messages to userspace when a dump is interrupted.

aboch commented 3 weeks ago

I guess you convinced me. @robmry can you please add a comment line on top of each exposed method saying the in case of EINTR returned error the collection of elements is still returned but may not be current?

adrianchiris commented 3 weeks ago

Hi all,

Adding some more points:

  1. kernel recommends to retry inconsistent dump commands [1]
  2. i dont think that iproute makes a habbit of using inconsistent results of dump commands. from grepping a bit in iproute code i found the example[2] listed in above comments used only in ip address flush[3]

ipaddr_flush() --- uses ---> rtnl_dump_filter_nc() --- uses ---> rtnl_dump_filter_l()

  1. NLM_F_DUMP_INTR is relevant only for dump commands (NLM_F_DUMP flag set ) so for non dump commands no need to do the check IMO.
  2. LinkByName in this library does a fallback to dump command only when kernel does not support getting the specific interface according to comments [4]

That said, i have no objection for the change as long as we still fail if results are inconsistent.

Thanks @adrianchiris - the moby (dockerd) maintainers were concerned about unbounded retries on the new EINTR. It's not quite like a retry on a normal EINTR from a system call as it's (perhaps) more likely to happen again - and the kernel interface gives the option of using the partial data.

please note that the kernel refers to this data as "inconsistent" rather then partial.

[1] https://github.com/torvalds/linux/blob/8d8d276ba2fb5f9ac4984f5c10ae60858090babc/Documentation/userspace-api/netlink/intro.rst#dump-consistency [2] https://github.com/iproute2/iproute2/blob/e9096586e0701d5ae031df2f2708d20d34ae7bd4/lib/libnetlink.c#L901 [3] https://github.com/iproute2/iproute2/blob/e9096586e0701d5ae031df2f2708d20d34ae7bd4/ip/ipaddress.c#L1996 [4] https://github.com/vishvananda/netlink/blob/a01829657b53687bcde90c030e2013e482783847/link_linux.go#L1871

robmry commented 3 weeks ago

I guess you convinced me. @robmry can you please add a comment line on top of each exposed method saying the in case of EINTR returned error the collection of elements is still returned but may not be current?

Done, except the error is ErrDumpInterrupted. Generally, the comment is ...

// If the returned error is [ErrDumpInterrupted], results may be inconsistent
// or incomplete.
robmry commented 3 weeks ago
  1. kernel recommends to retry inconsistent dump commands [1]

Yes, indeed. But, if the data is returned, the caller has the option of choosing to do something else.

  1. i dont think that iproute makes a habbit of using inconsistent results of dump commands. from grepping a bit in iproute code i found the example[2] listed in above comments used only in ip address flush[3]

ipaddr_flush() --- uses ---> rtnl_dump_filter_nc() --- uses ---> rtnl_dump_filter_l()

Also macro rtnl_dump_filter, which is used to process results for a lot of functions that make NLM_F_DUMP requests.

  1. NLM_F_DUMP_INTR is relevant only for dump commands (NLM_F_DUMP flag set ) so for non dump commands no need to do the check IMO.

Agreed - but this PR already only changes functions that do an NLM_F_DUMP.

  1. LinkByName in this library does a fallback to dump command only when kernel does not support getting the specific interface according to comments [4]

I think that's what I described in the example? ("in this package, LinkByName can fall back to using linkByNameDump, which searches a dump for the name"), and now in the godoc comment describing the ErrDumpInterrupted return. (Ditto for LinkByAlias.)

aboch commented 3 weeks ago

@robmry This project follows a "One change/fix/feature per pull request" and the "Squash and Merge" option is not enabled. Please squash your commits into one and push again.

robmry commented 3 weeks ago

@robmry This project follows a "One change/fix/feature per pull request" and the "Squash and Merge" option is not enabled. Please squash your commits into one and push again.

Oh, sure - done ...

I got rid of the gofmt changes, as they weren't directly related (only really there to stop my editor messing with code in the other commits). The other three commits were split to try to make reviews a bit easier:

(I've still got those changes ... happy to re-raise as separate PRs if preferred?)