Making systemd StartTransientUnit synchronous (mini post-mortem on that)

filbranden commented 6 years ago

Opening an issue here so we can try to understand the sequence to events, the current situation and what to do next.

cc @derekwaynecarr @vikaschoudhary16 @sjenning @mrunalp

The original behavior of libcontainer systemd cgroup driver was to send systemd the StartTransientUnit request via D-Bus, but not wait for a reply.
This caused a race condition, in that it's possible that runc/libcontainer would create the directories using os.MkdirAll (which creates the directories, if they don't exist.)
systemd (at least v219, version in RHEL 7) will decide which controllers get created (which might not be the whole set) and will remove controllers that weren't supposed to exist.
Due to this race, a controller directory might get created by runc/libcontainer, while later be removed by systemd (still in the process of creating the cgroups for the scope.)
In order to fix this issue, PR #1683 attempted to make StartTransientUnit synchronous, so that the whole creation of the scope/slice by systemd is completed before runc/libcontainer does any cgroup processing. a. I believe this approach is appropriate. Waiting for the operation to complete before proceeding is the correct behavior here. b. The problem with PR #1683 was that the StartTransientUnit call breaks on errors only when err != nil && !isUnitExists(err), so in the case where the unit already exists, blocking on the channel is a mistake, since there's no pending operation from systemd. (Personal note: this would have been unlikely to have happened on a language with proper exception handling.)
The bug introduced in step 5(b) started causing trouble in kubelet (Kubernetes), since it does call libcontainer on the same cgroup multiple times, and the latter calls will trigger the isUnitExists(err) situation that will make the libcontainer code now block on a channel that will never receive any message. So kubelet startup gets stuck there. (BTW, I spent a couple days trying to troubleshoot that myself too, seeing kubelet mysteriously fail to initialize when using systemd cgroup driver.)
To address that bug, PR #1754 introduced a timeout on blocking on the channel. If an answer was not received within one second, we would stop waiting on the channel. a. A timeout is probably appropriate, blocking forever is really not good. I'm not sure one second is long enough for a timeout. I also think the action of assuming everything went well in this case is also not necessarily appropriate. Perhaps this can be done better here, maybe retry the call (which should be idempotent, since creating the same unit multiple times will actually get into the isUnitExists case we've discussed previously.) b. This PR also introduced a bug. Since now, if the timeout occurs, then no one is ever reading from that channel again, which means later on when the job completes, the code writing to it will block, forever.
I found the bug in 5(b) and pushed PR #1772 to fix it. In doing so, I kept the code adding the timeout, in step 7, after all according to 7(a), having a timeout is definitely not inappropriate. At the time, I didn't realize bug 7(b) existed, so I didn't address it.
Trying to backport these into Kubernetes code base, in PR kubernetes/kubernetes#61926, @derekwaynecarr raised what I believe is the bug in 7(b) (since he added references to startJob(), jobComplete() and the jobListener lock in coreos/go-systemd/dbus.) a. I think we need to address the actual bug in 7(b).
In that same discussion, @vikaschoudhary16 suggests that the version of systemd in RHEL 7 has been "fixed", I assume that's referring to the behavior described in (3). I imagine this was a backport from an upstream patch on that behavior. a. I'd appreciate to know more details about this, including which upstream patch has been used to fix this behavior. b. Regardless of the change in behavior in systemd, I still think the intent of the PR in (5) was correct, the call to StartTransientUnit should be blocking and even if it's benign, we should avoid any race between systemd and runc/libcontainer manipulating the cgroup tree.
@sjenning suggests reverting all the three PRs (here, in libcontainer), since according to (10), the issue introduced by the race condition no longer exists, so we could revert back to the original behavior. a. Once again, I don't think that's appropriate, since even if it's benign, a race condition is still there and could come back to bite us in the future. Having the "create cgroup" operation be synchronous is the appropriate behavior here. In any other situation, where we're creating a resource and then consuming a resource, I'm sure we would have wanted the same. b. Rather than reverting everything, I think we should address the bug in 7(b) and fix it.
Small note on a circumstance that makes the bug even worse... The code that's writing on the channel (and is getting blocked when the bug in 7(b) triggers) is within a critical section where a lock is being held. That means every other operation that involves that lock will also get stuck. a. I think that's a bug too, and should be addressed.

My proposals here are that, instead of rolling back, we fix the actual bugs that still exist here. I still assert that keeping the synchronous behavior of StartTransientUnit is the proper approach.

Regarding the potential fixes (or workarounds) for 7(b):

We could create the channel using statusChan := make(chan string, 1) which buffers one write to this channel. That way, the write to the channel will not block, even if we stopped listening on it due to a timeout. This is very localized and doesn't really have bad side effects, even when timeouts occur. We could also increase the timeout from 1s to something a bit higher. I think this would probably be a good start.
We should probably address the locking in coreos/go-systemd/dbus, moving the write to the channel to outside the lock. (There's no harm in doing that outside the lock.) We might further consider doing that write from a go-routine, so if it blocks we don't really get stuck. (The downside there is that we risk leaking go-routines in this situation.)
We could consider extending the API of coreos/go-systemd/dbus to add the ability to cancel a blocking channel, in case of a timeout, so no write is done when not needed. But this looks like it might be quite intrusive to add, has some races of its own that have to be dealt with, so not sure whether it's actually worth it...
We could consider doing some kind of retry logic in case of timeout. Simply assuming the operation succeeded in that case looks really wrong... We could consider adding timeout/retry logic to coreos/go-systemd/dbus instead of here. (Not really sure, need to investigate.) We probably don't need that right away, but long term, it's something we should probably consider.

I think that's all for now... I hope this describes the situation fairly accurately, otherwise please leave comments (we can come back and edit the description to fix anything if there are mistakes.)

Cheers, Filipe

cyphar commented 6 years ago

I've just skimmed this issue quite lightly. I think that the hotfix of make(chan string, 1) is acceptable, with the caveat that we probably should fix up the CoreOS library because our current handling of these sorts of issues is getting kinda ridiculous.

However, all of that being said, I have a more important question to ask -- why are people using the systemd cgroup driver so heavily? What is the use-case? Why does the ordinary cgroupfs driver not work for you? The reason I ask this (and I still haven't gotten an answer to this question -- even though I bring it up every time we have to deal with more systemd bugs) is that systemd is very notorious for being an incredibly "bad sport" when trying to use it in the way we do (it has had many bugs in the past -- even with Delegate support -- where it would break cgroup constraints for containers we created just because it felt like it). This is why it's no longer automatically used on systemd-based systems (you need to explicitly opt-in to it and this has been the case for several years), and it's why cgroupfs is used so widely by default.

So it would really help to know what is the problem that the systemd cgroup code fixes that is not fixed by cgroupfs (and it should be noted that the systemd cgroup code actually uses the cgroupfs driver for most of its operations -- because the systemd interfaces are useless for us).

filbranden commented 6 years ago

Hi @cyphar

Some reasons to use the systemd driver:

Managing both containers and system services using the same cgroup interfaces. Otherwise you get containers in an "island" using an unique interface.
Preparing for upcoming changes to the kernel cgroup interfaces. The kernel cgroup maintainer has been working closely with the systemd developers to ensure they're in sync. (This includes the unified hierarchy and cgroup v2, but in general new cgroup features are being discussed with systemd folks often before making into the kernel.)

Regardless, some projects and distros are already using systemd cgroup driver and that's unlikely to change. I still think we need to support those.

While I agree with some of your points about difficulties with the systemd cgroup driver, I think part the problem is here in the libcontainer driver. It's quite buggy, as you can see. I'm planning to try to address that situation. I think that could alleviate the problems you describe a lot.

Anyways, the point is we do have users of it, so let's try to at least fix the bugs. I'm open to a larger process of evaluating how architectural changes to it might make it even more reliable, and currently planning to prehaps undertake such an effort.

Cheers, Filipe

cyphar commented 6 years ago

Managing both containers and system services using the same cgroup interfaces. Otherwise you get containers in an "island" using an unique interface.

By "the same cgroup interfaces" are you referring to the systemd helpers (because in reality there is only one set of cgroup interfaces offered by the kernel -- and it's definitely not the systemd ones :wink:)?

Preparing for upcoming changes to the kernel cgroup interfaces. The kernel cgroup maintainer has been working closely with the systemd developers to ensure they're in sync. (This includes the unified hierarchy and cgroup v2, but in general new cgroup features are being discussed with systemd folks often before making into the kernel.)

I agree that cgroupv2 is going to be a very large issue, though I guess I have a less optimistic view of how that is going to pan out (the removal of name= cgroups and the unified hierarchy means that you can no longer use cgroups to group processes in different hierarchies -- which means that systemd now tautologically must control the cgroups for the entire system, even though we have a history of systemd doing this very badly). I'm going to be honest, at this point I don't trust systemd to do things sanely if you want it to just manage cgroups (I've dealt with far too many bugs to give them the benefit of the doubt anymore).

(Also systemd broke all container runtimes in one release because they pushed cgroupv2 incorrectly and too early. If that was the opening act, I'm a little worried about what's going to follow it.)

Regardless, some projects and distros are already using systemd cgroup driver and that's unlikely to change. I still think we need to support those. [...] Anyways, the point is we do have users of it, so let's try to at least fix the bugs. I'm open to a larger process of evaluating how architectural changes to it might make it even more reliable, and currently planning to prehaps undertake such an effort.

We do need to support people using the code we already have, yes. But that doesn't mean we should be ignoring the fact that systemd has caused many more issues when using the systemd cgroup driver over the cgroupfs one -- if people can switch away they should.

While I agree with some of your points about difficulties with the systemd cgroup driver, I think part the problem is here in the libcontainer driver. It's quite buggy, as you can see. I'm planning to try to address that situation. I think that could alleviate the problems you describe a lot.

I disagree (though I agree that the libcontainer driver is pretty awful in its current state). The bugs you're referring to in the original post all spawn from bugs in systemd or DBus which resulted in messages not being passed to us, and then later attempts to fix the broken fixes. Whether or not we should've handled this edge case before is up for debate, but let's say that was an example of libcontainer broken-ness. This bug is quite minor in comparison to the bugs I was referring to.

The bugs I'm referring to are entirely on the systemd side of things. If you like, I can pull up our Bugzilla entries where we had several reproducible cases where creating a TransientUnit would result in systemd moving our container outside of the cgroup path we placed it in (and set limits on) and into a different cgroup that was effectively unconfined. There were several other bugs that had similarly confusing effects, and all of them were due to systemd moving processes around because we told it about them (not to mention that we in general are forced to place processes in the name=systemd cgroup because otherwise systemd might randomly decide to reorganise random processes on the machine). These were effectively security issues in runc because of systemd (maybe we should've gotten a CVE assigned -- because I think it also moved the container process outside of the devices cgroup which is simply an insane thing to do -- though I might be misremembering).

I will happily admit that personally I don't think runc should have any systemd support because of these types of issues (and LXC works totally fine without a systemd driver). But since we already have one, I think any discussion of bugs in it should start with a caution that people shouldn't really be using it unless they need to use it. Other than that, the fix discussed looks fine.

opencontainers / runc

Making systemd StartTransientUnit synchronous (mini post-mortem on that) #1780