Open filbranden opened 6 years ago
I've just skimmed this issue quite lightly. I think that the hotfix of make(chan string, 1)
is acceptable, with the caveat that we probably should fix up the CoreOS library because our current handling of these sorts of issues is getting kinda ridiculous.
However, all of that being said, I have a more important question to ask -- why are people using the systemd cgroup driver so heavily? What is the use-case? Why does the ordinary cgroupfs driver not work for you? The reason I ask this (and I still haven't gotten an answer to this question -- even though I bring it up every time we have to deal with more systemd bugs) is that systemd is very notorious for being an incredibly "bad sport" when trying to use it in the way we do (it has had many bugs in the past -- even with Delegate
support -- where it would break cgroup constraints for containers we created just because it felt like it). This is why it's no longer automatically used on systemd-based systems (you need to explicitly opt-in to it and this has been the case for several years), and it's why cgroupfs is used so widely by default.
So it would really help to know what is the problem that the systemd cgroup code fixes that is not fixed by cgroupfs (and it should be noted that the systemd cgroup code actually uses the cgroupfs driver for most of its operations -- because the systemd interfaces are useless for us).
Hi @cyphar
Some reasons to use the systemd driver:
Regardless, some projects and distros are already using systemd cgroup driver and that's unlikely to change. I still think we need to support those.
While I agree with some of your points about difficulties with the systemd cgroup driver, I think part the problem is here in the libcontainer driver. It's quite buggy, as you can see. I'm planning to try to address that situation. I think that could alleviate the problems you describe a lot.
Anyways, the point is we do have users of it, so let's try to at least fix the bugs. I'm open to a larger process of evaluating how architectural changes to it might make it even more reliable, and currently planning to prehaps undertake such an effort.
Cheers, Filipe
Managing both containers and system services using the same cgroup interfaces. Otherwise you get containers in an "island" using an unique interface.
By "the same cgroup interfaces" are you referring to the systemd helpers (because in reality there is only one set of cgroup interfaces offered by the kernel -- and it's definitely not the systemd ones :wink:)?
Preparing for upcoming changes to the kernel cgroup interfaces. The kernel cgroup maintainer has been working closely with the systemd developers to ensure they're in sync. (This includes the unified hierarchy and cgroup v2, but in general new cgroup features are being discussed with systemd folks often before making into the kernel.)
I agree that cgroupv2 is going to be a very large issue, though I guess I have a less optimistic view of how that is going to pan out (the removal of name=
cgroups and the unified hierarchy means that you can no longer use cgroups
to group processes in different hierarchies -- which means that systemd now tautologically must control the cgroups for the entire system, even though we have a history of systemd doing this very badly). I'm going to be honest, at this point I don't trust systemd to do things sanely if you want it to just manage cgroups (I've dealt with far too many bugs to give them the benefit of the doubt anymore).
(Also systemd broke all container runtimes in one release because they pushed cgroupv2 incorrectly and too early. If that was the opening act, I'm a little worried about what's going to follow it.)
Regardless, some projects and distros are already using systemd cgroup driver and that's unlikely to change. I still think we need to support those. [...] Anyways, the point is we do have users of it, so let's try to at least fix the bugs. I'm open to a larger process of evaluating how architectural changes to it might make it even more reliable, and currently planning to prehaps undertake such an effort.
We do need to support people using the code we already have, yes. But that doesn't mean we should be ignoring the fact that systemd has caused many more issues when using the systemd cgroup driver over the cgroupfs one -- if people can switch away they should.
While I agree with some of your points about difficulties with the systemd cgroup driver, I think part the problem is here in the libcontainer driver. It's quite buggy, as you can see. I'm planning to try to address that situation. I think that could alleviate the problems you describe a lot.
I disagree (though I agree that the libcontainer driver is pretty awful in its current state). The bugs you're referring to in the original post all spawn from bugs in systemd or DBus which resulted in messages not being passed to us, and then later attempts to fix the broken fixes. Whether or not we should've handled this edge case before is up for debate, but let's say that was an example of libcontainer broken-ness. This bug is quite minor in comparison to the bugs I was referring to.
The bugs I'm referring to are entirely on the systemd side of things. If you like, I can pull up our Bugzilla entries where we had several reproducible cases where creating a TransientUnit would result in systemd moving our container outside of the cgroup path we placed it in (and set limits on) and into a different cgroup that was effectively unconfined. There were several other bugs that had similarly confusing effects, and all of them were due to systemd moving processes around because we told it about them (not to mention that we in general are forced to place processes in the name=systemd
cgroup because otherwise systemd might randomly decide to reorganise random processes on the machine). These were effectively security issues in runc
because of systemd (maybe we should've gotten a CVE assigned -- because I think it also moved the container process outside of the devices
cgroup which is simply an insane thing to do -- though I might be misremembering).
I will happily admit that personally I don't think runc
should have any systemd support because of these types of issues (and LXC works totally fine without a systemd driver). But since we already have one, I think any discussion of bugs in it should start with a caution that people shouldn't really be using it unless they need to use it. Other than that, the fix discussed looks fine.
Opening an issue here so we can try to understand the sequence to events, the current situation and what to do next.
cc @derekwaynecarr @vikaschoudhary16 @sjenning @mrunalp
os.MkdirAll
(which creates the directories, if they don't exist.)err != nil && !isUnitExists(err)
, so in the case where the unit already exists, blocking on the channel is a mistake, since there's no pending operation from systemd. (Personal note: this would have been unlikely to have happened on a language with proper exception handling.)kubelet
(Kubernetes), since it does call libcontainer on the same cgroup multiple times, and the latter calls will trigger theisUnitExists(err)
situation that will make the libcontainer code now block on a channel that will never receive any message. Sokubelet
startup gets stuck there. (BTW, I spent a couple days trying to troubleshoot that myself too, seeingkubelet
mysteriously fail to initialize when using systemd cgroup driver.)isUnitExists
case we've discussed previously.) b. This PR also introduced a bug. Since now, if the timeout occurs, then no one is ever reading from that channel again, which means later on when the job completes, the code writing to it will block, forever.startJob()
,jobComplete()
and thejobListener
lock incoreos/go-systemd/dbus
.) a. I think we need to address the actual bug in 7(b).My proposals here are that, instead of rolling back, we fix the actual bugs that still exist here. I still assert that keeping the synchronous behavior of StartTransientUnit is the proper approach.
Regarding the potential fixes (or workarounds) for 7(b):
statusChan := make(chan string, 1)
which buffers one write to this channel. That way, the write to the channel will not block, even if we stopped listening on it due to a timeout. This is very localized and doesn't really have bad side effects, even when timeouts occur. We could also increase the timeout from 1s to something a bit higher. I think this would probably be a good start.coreos/go-systemd/dbus
, moving the write to the channel to outside the lock. (There's no harm in doing that outside the lock.) We might further consider doing that write from a go-routine, so if it blocks we don't really get stuck. (The downside there is that we risk leaking go-routines in this situation.)coreos/go-systemd/dbus
to add the ability to cancel a blocking channel, in case of a timeout, so no write is done when not needed. But this looks like it might be quite intrusive to add, has some races of its own that have to be dealt with, so not sure whether it's actually worth it...coreos/go-systemd/dbus
instead of here. (Not really sure, need to investigate.) We probably don't need that right away, but long term, it's something we should probably consider.I think that's all for now... I hope this describes the situation fairly accurately, otherwise please leave comments (we can come back and edit the description to fix anything if there are mistakes.)
Cheers, Filipe