Rewrite Allocator - Githubissues

dperny commented 6 years ago

The Allocator code is bad. I'd hazard to say it's irredeemably bad. It's a source of constant bugs and breakages. Things are allocated twice or not allocated at all. The issues don't seem to be present (usually) in the levels deeper than swarmkit (libnetwork). Instead, libnetwork tends to give garbage responses when given garbage input.

Some examples of problems:

A bunch of code for handling multiple allocators, even though only 1 allocator exists and no other allocators are anywhere in the roadmap
Responsibilities muddled. All of the network allocation methods are just implemented straight on the Allocator object, even thought presence of a allocActor type in the Run function suggests they should be separate.
Code is under-commented and tangled, making parsing it often inscrutable. The tangled nature of the code makes piece-wise refactoring almost impossible, because of the fragile nature of the code.
The network allocator keeps a local state from which it makes decisions, which is supposed to be a mirror of the state as committed into raft. However, logic errors can cause this internal state to become irrecoverably inconsistent. If we're lucky, a leadership change fixes this before it becomes a problem. If we're not, bad local state is used to make bad decisions and create inconsistent distributed state. We need to reduce the footprint of the local state as much as feasible.
The allocator requires the local state to be initialized from the distributed state before performing any new allocations. However, the logic for allocation and initialization follow the same code paths, and use a boolean flag to separate them. Errors in the initialization logic cause allocations and deallocations to occur before the local state is fully initialized, resulting in duplicate IP addresses.

We should rewrite the whole thing from scratch. It's not a small project, and there's a lot of risk in a rewrite versus a refactoring. However, a clean slate would let us escape the most ingrained design flaws.

dperny commented 6 years ago

Another example. I happened to notice this while going through the code:

https://github.com/docker/swarmkit/blob/2543c41ec5bb2646d2cebba6869016efe484d687/manager/allocator/allocator.go#L97-L109

You can't take a pointer to a loop variable. the same memory location is reused for each iteration. The value pointed to changes every time. This was a landmine waiting to happen.

anshulpundir commented 6 years ago

Can you please include code snippets for the last two points in the problem statement. I tried reading the code but it wasn't obvious to me. @dperny

dperny commented 6 years ago

The network allocator, like, the IPAM stuff, doesn't live in the raft store. It has its own internal state. Before you can allocate new IP addresses, you have to "allocate" all of the old IP addresses that are assigned in the raft store. This involves iterating through every object (node, network, service, task) and "allocating" their assigned IP addresses. However, it's possible for an object to be committed only partially allocated, where it has some IP addresses assigned and some not. This can happen, for example, when a spec has been updated, but the allocator hasn't run on the object yet.

In the cnmallocator code, eventually you get to something like this, ipam.RequestAddress (in all code paths, slightly different for each object type).

https://github.com/docker/swarmkit/blob/49a9d7f6ba3c1925262641e694c18eb43575f74b/manager/allocator/cnmallocator/networkallocator.go#L620

The difference between initialization and new allocation is whether or not addr is empty or an ip address. We have tons of bugs where we're passing an empty IP address to this method before we've fully iterated through every object and populated the ipam with the existing IPs. There's nothing stopping this from happening, because this is the same code path for new and old allocations. If you have an object in the above-mentioned half-allocated state, it's really tricky to get this right.

This is in doNetworkInit. The magic boolean flag on each method call control whether or not we're initializing the state, or we're allocating new things. This boolean flag gets passed down through methods and sub-methods, and switches off whether or not we perform allocation.

https://github.com/docker/swarmkit/blob/49a9d7f6ba3c1925262641e694c18eb43575f74b/manager/allocator/network.go#L155-L175

dperny commented 6 years ago

This is done as part of #2615, and work on it is tracked on the new-allocator branch in this repo.

To complete the new allocator rewrite, we need to merge #2686, which removes the old allocator.

Additionally, there have been changes to the old allocator since the rewrite began, which must be "forward-ported" to the new one. These are:

[ ] #2725 - Only allocate attachments needed by nodes
[ ] #2714 - Global Default Address Pool feature support

olljanat commented 5 years ago

@dperny btw #2714 and #2725 looks to be merged (so they should me marked as done).

Is there any time plan for this one?

moby / swarmkit

Rewrite Allocator #2516