projectatomic / oci-systemd-hook

OCI hook to enable running systemd in a container
GNU General Public License v3.0
64 stars 26 forks source link

running prestart hook 0 caused error #71

Open pschiffe opened 7 years ago

pschiffe commented 7 years ago

Hello,

oci-systemd-hook on latest RHEL atomic host doesn't work with runc.

# atomic host status
State: idle
Deployments:
● rhel-atomic-host:rhel-atomic-host/7/x86_64/standard
             Version: 7.3.6 (2017-06-23 16:20:45)
              Commit: e073a47baa605a99632904e4e05692064302afd8769a15290d8ebe8dbfd3c81b
# rpm -q oci-systemd-hook runc
oci-systemd-hook-0.1.7-4.gite533efa.el7.x86_64
runc-1.0.0-6.gite800860.el7.x86_64

This is the error I'm getting:

runc[15209]: container_linux.go:259: starting container process caused "process_linux.go:345: container init caused \"process_linux.go:328: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""

Config.json template is here: https://github.com/pschiffe/gce-system-container/blob/master/image/config.json.template

Can you help? How to debug?

rhatdan commented 7 years ago

It might be a mismatch between the version of runc and the version of oci-systemd-hook? Did those two versions ship together? How did you call out to oci-systemd-hook?

pschiffe commented 7 years ago

Those two packages should align correctly as both are from the latest version of rhel atomic host (7.3.6):

# rpm -q oci-systemd-hook runc
oci-systemd-hook-0.1.7-4.gite533efa.el7.x86_64
runc-1.0.0-6.gite800860.el7.x86_64

This is how I'm calling hooks in the config.json.template:

    "hooks": {
        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook"
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine"
            }
        ],
        "poststop": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook"
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine"
            }
        ]
},
rhatdan commented 7 years ago

The hooks expect the first option to the hook to be the prestart/poststart. Add those arguments and it should work.

Hooks can not tell which phase they are running unless you pass in the argv[1]

pschiffe commented 7 years ago

Unfortunately, even with the args I see the same error:

runc[16416]: container_linux.go:259: starting container process caused "process_linux.go:345: container init caused \"process_linux.go:328: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""

I've updated the config.json hooks to look like this:

    "hooks": {
        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "prestart" ]
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine",
                "args": [ "prestart" ]
            }
        ],
        "poststop": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "poststop" ]
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine",
                "args": [ "poststop" ]
            }
        ]
    },

BTW, my previous configuration (without the args) worked fine on rhel atomic host 7.3.2.

pschiffe commented 7 years ago

I've tried the same image on centos atomic host continuous:

# atomic host status
State: idle
Deployments:
● centos-atomic-continuous:centos-atomic-host/7/x86_64/devel/continuous
                   Version: 7.2017.490 (2017-07-19 16:05:22)
                    Commit: a948637e77018755831659c791a8cf8595f6d267d4fddba24a6b42f0ec6f1bd7

# rpm -q oci-systemd-hook runc
oci-systemd-hook-0.1.7-4.gite533efa.el7.x86_64
runc-1.0.0-9.git6394544.el7.x86_64

But I see the same error:

runc[12188]: container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"process_linux.go:322: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""
rhatdan commented 7 years ago

@mrunalp WDYT?

@pschiffe Are you seeing anything in the journal? Are you seeing the same thing in Fedora?

pschiffe commented 7 years ago

Hmm, I see something more in journal:

Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Starting Linux Guest Environment for Google Compute Engine...
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Scope libcontainer-12188-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Scope libcontainer-12188-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Created slice libcontainer_12188_systemd_test_default.slice.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Starting libcontainer_12188_systemd_test_default.slice.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Removed slice libcontainer_12188_systemd_test_default.slice.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Stopping libcontainer_12188_systemd_test_default.slice.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Started libcontainer container gce-agents.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Starting libcontainer container gce-agents.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal prestart[12201]: systemdhook <error>: root not found in state: Success
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Stopped libcontainer container gce-agents.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Stopping libcontainer container gce-agents.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal poststop[12204]: systemdhook <error>: root not found in state: Success
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal runc[12188]: container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"process_linux.go:322: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service: main process exited, code=exited, status=1/FAILURE
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Unit gce-agents.service entered failed state.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service failed.
giuseppe commented 7 years ago

@pschiffe you should specify args as:

        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "oci-systemd-hook", "prestart" ]
            },

as they are passed as they are to exec. args[0] can really be anything as it is not used.

It would be nice if it still works without any arg as it used to work before, I am taking a look right now.

pschiffe commented 7 years ago

Didn't help, still the same error:

Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Starting Linux Guest Environment for Google Compute Engine...
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Scope libcontainer-12731-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Scope libcontainer-12731-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Created slice libcontainer_12731_systemd_test_default.slice.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Starting libcontainer_12731_systemd_test_default.slice.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Removed slice libcontainer_12731_systemd_test_default.slice.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Stopping libcontainer_12731_systemd_test_default.slice.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Started libcontainer container gce-agents.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Starting libcontainer container gce-agents.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal oci-systemd-hook[12744]: systemdhook <error>: root not found in state: Success
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Stopped libcontainer container gce-agents.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Stopping libcontainer container gce-agents.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal oci-systemd-hook[12747]: systemdhook <error>: root not found in state: Success
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal runc[12731]: container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"process_linux.go:322: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service: main process exited, code=exited, status=1/FAILURE
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Unit gce-agents.service entered failed state.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service failed.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service holdoff time over, scheduling restart.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: start request repeated too quickly for gce-agents.service
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Failed to start Linux Guest Environment for Google Compute Engine.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Unit gce-agents.service entered failed state.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service failed.
    "hooks": {
        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "oci-systemd-hook", "prestart" ]
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine",
                "args": [ "oci-register-machine", "prestart" ]
            }
        ],
        "poststop": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "oci-systemd-hook", "poststop" ]
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine",
                "args": [ "oci-register-machine", "poststop" ]
            }
        ]
    },
rhatdan commented 7 years ago

This means that the config file you generated does not indicate with the "root" is, oci-systemd-hook and oci-umount need to go to the root of the container to mount or umount content, can you add this to your config?

rhatdan commented 7 years ago

@giuseppe we can not rely on the fact that he pid=0 or not, since runc specifies three ways to run a hook. prestart poststart poststop We can't tell the difference betwee a prestart and a poststart call. Since we are building this feature into cri-o, we don't want to ignore the specification.

pschiffe commented 7 years ago

@rhatdan do you have an example how to specify the "root"? I can't find what you mean. I have in the config the following:

    "root": {
        "path": "rootfs",
        "readonly": true
    },

and in the process section, there is "cwd": "/",

giuseppe commented 7 years ago

We can decide if be backward compatible and handle only prestart and poststop when the type is not specified. On the other hand there are not probably many users out there so we can just enforce it and not worry in the future of supporting this additional case.

Anyway the current development version support the case where the hook is not specified. The issue reported here depend on the hook version that is too old. The issue is fixed upstream with:

commit 69858facb2597d4bc0df3ed405876d6c167b943f Author: Daniel J Walsh dwalsh@redhat.com Date: Fri Jun 23 19:31:05 2017 +0000

Needs this change to work directly with runc and cri-o

The path bundle path passed to a container is called bundle, not bundlepath.
Also root is not in the state file, but can be retrieved from the config.json.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>

Also this other commit is required, otherwise oci-systemd-hook will just segfault:

commit 40ab578cc3550b25c52e64e43d82a65d23e3365e Author: Jason Wessel jason.wessel@windriver.com Date: Wed Jul 12 09:21:44 2017 -0700

Allow container definitions where rootfs is not an absolute path

In earlier versions of the runc frame work the rootfs path was passed
as a key with the initial json that was passed on the stdin and it was
automatically computed to be an absolute path.

This translation to an absolute path must be done in the hook based on
the bundlePath.  This allows the config.json to be relocated by the
container hosting system storage without modifying the config.json.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
rhatdan commented 7 years ago

So we need to get this updated in RHEL7 package. @lsm5 Can you create a new oci-systemd-hook package for RHEL?

fkluknav commented 7 years ago

rebased to 1e84754 in rhel 7.4 branch

pschiffe commented 7 years ago

Would it be possible to also get it to the rhel atomic host 7.3?

rhatdan commented 7 years ago

There is only on stream of extras, so if you install oci-umount after it gets shipped, it wil be placed into RHEL7.3

wking commented 6 years ago

Some of the discussion here is about how the hook decides which stage its running in. The most portable approach to that is to use status in stdin's state JSON (kubernetes-incubator/cri-o#1360). I have an open PR for that against the very-similar oci-umount: projectatomic/oci-umount#35. If/when that lands I'll cherry-pick it over here.

caoruidong commented 6 years ago

something maybe useful: https://github.com/intel/cc-oci-runtime/issues/270

runc is passing root rather than bundlePath to the hooks. oci-systemd-hook is compounding the problem by requiring the erroneous root value.