moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.36k stars 615 forks source link

Rewrite Allocator #2516

Open dperny opened 6 years ago

dperny commented 6 years ago

The Allocator code is bad. I'd hazard to say it's irredeemably bad. It's a source of constant bugs and breakages. Things are allocated twice or not allocated at all. The issues don't seem to be present (usually) in the levels deeper than swarmkit (libnetwork). Instead, libnetwork tends to give garbage responses when given garbage input.

Some examples of problems:

We should rewrite the whole thing from scratch. It's not a small project, and there's a lot of risk in a rewrite versus a refactoring. However, a clean slate would let us escape the most ingrained design flaws.

dperny commented 6 years ago

Another example. I happened to notice this while going through the code:

https://github.com/docker/swarmkit/blob/2543c41ec5bb2646d2cebba6869016efe484d687/manager/allocator/allocator.go#L97-L109

You can't take a pointer to a loop variable. the same memory location is reused for each iteration. The value pointed to changes every time. This was a landmine waiting to happen.

anshulpundir commented 6 years ago

Can you please include code snippets for the last two points in the problem statement. I tried reading the code but it wasn't obvious to me. @dperny

dperny commented 6 years ago

The network allocator, like, the IPAM stuff, doesn't live in the raft store. It has its own internal state. Before you can allocate new IP addresses, you have to "allocate" all of the old IP addresses that are assigned in the raft store. This involves iterating through every object (node, network, service, task) and "allocating" their assigned IP addresses. However, it's possible for an object to be committed only partially allocated, where it has some IP addresses assigned and some not. This can happen, for example, when a spec has been updated, but the allocator hasn't run on the object yet.

In the cnmallocator code, eventually you get to something like this, ipam.RequestAddress (in all code paths, slightly different for each object type).

https://github.com/docker/swarmkit/blob/49a9d7f6ba3c1925262641e694c18eb43575f74b/manager/allocator/cnmallocator/networkallocator.go#L620

The difference between initialization and new allocation is whether or not addr is empty or an ip address. We have tons of bugs where we're passing an empty IP address to this method before we've fully iterated through every object and populated the ipam with the existing IPs. There's nothing stopping this from happening, because this is the same code path for new and old allocations. If you have an object in the above-mentioned half-allocated state, it's really tricky to get this right.

This is in doNetworkInit. The magic boolean flag on each method call control whether or not we're initializing the state, or we're allocating new things. This boolean flag gets passed down through methods and sub-methods, and switches off whether or not we perform allocation.

https://github.com/docker/swarmkit/blob/49a9d7f6ba3c1925262641e694c18eb43575f74b/manager/allocator/network.go#L155-L175

dperny commented 6 years ago

This is done as part of #2615, and work on it is tracked on the new-allocator branch in this repo.

To complete the new allocator rewrite, we need to merge #2686, which removes the old allocator.

Additionally, there have been changes to the old allocator since the rewrite began, which must be "forward-ported" to the new one. These are:

olljanat commented 5 years ago

@dperny btw #2714 and #2725 looks to be merged (so they should me marked as done).

Is there any time plan for this one?