davidpelaez commented 10 years ago

Possible slices are hardcoded in the source; any slice should be accepted if installed

PR #179 is a nice addition that was missing. However, If I wanted to have extra slices I'd have to recompile everything because the validation of the --slice flag is made against an array defined in the code:

(https://github.com/jwhonce/geard/blob/f2f109a5f6a1ef5e678655d72b538204ecd88562/containers/jobs/init.go#L54-L58)

Wouldn't it be best to check for the presence of the slice in systemd instead of it being limited to the slices in the source?

I think that for the init job the default slices are great, but according to how many resources a system have, those definitions can fall short and may not reflect a specific case logic organization of containers. e.g: If I wanted to have containers related to cluster admins tasks on one side and public apps in another currently I don't have many alternatives for this (besides patching).

Also, there's a problem with potentially ignored resources (memory in this case). Since slices group resources for all their children, if I have a machine with 8gb of memory, the slice would left a lot of it unused (~7gb - system usage) even if the container-large.slice is used.*\ Makes sense for development, but not for big nodes than can handle many containers.

I can replace the slice file in /etc/systemd/system but that feels very hackish to me. Also I'm wondering if alternative or complementary to slices, individual containers should be the ones having memory limits, like a dyno in Heroku, I know how much a container gets.

Maybe I'm missing something here about why slices were chosen and not container memory accounting. I'd say that a case where I can have a slice without memory limits and containers with specific limits seems more logical to me.

Any thoughts?

*\ this is what I understood from the systemd documentation, but to be honest there aren't that many mentions to slices out there, so please correct me if I'm wrong.

davidpelaez commented 10 years ago

I was running a memory test using a script found here:

http://stackoverflow.com/questions/4964799/write-a-bash-shell-script-that-consumes-a-constant-amount-of-ram-for-a-user-defi

So basically I created a busybox container that uses a lot of memory.

Then I run systemd-cgtop and I think I simply don't understand how the slices work, how container usage of memory is limited, etc. check this ss:

memtest

I would've thought that the containers wouldn't reach beyond the 512mb memory but they clearly did.

This was made in the vagrant vm defined in the repo. I thought it post this here probably to clarify doubts about resource limiting that could make it to the docs

davidpelaez commented 10 years ago

Running the same container with the memory flag on docker does limit the memory access of the container docker run -ti --rm -m 50m memtest

davidpelaez commented 10 years ago

changes the name of the issue because it's now broader!

smarterclayton commented 10 years ago

I missed this before - sorry for the delay. State of cgroups:

Right now the cgroup that "docker run" is executed under isn't inherited by Docker - long term we think the right solution for most docker execution patterns is that the docker container processes actually run in the context they're executed (so they can be proper children of the init system). That's currently work going on upstream (referred to as foreground execution or independent Docker engines)
Until that happens, we wanted to inherit the systemd cgroup by passing it through docker run. There was a patch for this but we're trying to sort out the "--opt" behavior upstream to allow this stuff.
The list of cgroups shouldn't be hardcoded - any child of container.slice should be allowed. An admin could then go and create the slice across a cluster and have it picked up.
There's a bit of debate going on between passing cgroups parameters down to containers (I want you to have X mem, Y cpu, Z network, etc) vs using predefined slices. Both have advantages in different use cases - it seems like we'd want to support both.

openshift / geard

Slice function is unclear and doesn't limit memory #195

Possible slices are hardcoded in the source; any slice should be accepted if installed