hyperhq / runv

Hypervisor-based Runtime for OCI
Apache License 2.0
828 stars 129 forks source link

notify event when start container failed #370

Closed Crazykev closed 7 years ago

Crazykev commented 7 years ago

add a event notify event when start container failed, fix #366

Signed-off-by: Crazykev crazykev@zju.edu.cn

WeiZhang555 commented 7 years ago

This can be one solution, but not enough. here: https://github.com/hyperhq/runv/blob/master/supervisor/hyperpod.go#L415

func (hp *HyperPod) createContainer(container, bundlePath, stdin, stdout, stderr string, spec *specs.Spec) (*Container, error) {
...
    c.run(p)
    return c, nil
}

Indeed c.run(p) failed but createContainer didn't check the error, it return successfully and docker will regard the container as running successfully, but it's not true.

Your patch will let container be running then exited but it's not good enough.

What's worse, the c.start(p) can sometimes hang, and the codes you added won't be reached. To fully solve this problem:

  1. we have to find why the start() function hangs, or the bug will always be there.
  2. we should let c.run(p) return error/nil when container start() failed/succeed, we can't simply ignore the error from run(p), that makes no sense.
Crazykev commented 7 years ago

@WeiZhang555 Actuall, c.run(p) will(and should) feedback the result through Event mechanism, client will get feedback from here: https://github.com/hyperhq/runv/blob/master/start.go#L266 , After c.CreateContainer at L252. The reason why it hangs(in my test) is that if c.start(p) failed with error from here: https://github.com/hyperhq/runv/blob/master/supervisor/container.go#L35 will reap and return, without feedback event. So event loop from client(start.go#L266) will never get any event, that is hang forever.

The patch here just solve this problem, if start container failed in c.start(p), make sure client get notified. But there is still something bad, like we can't get the real reason why start failed.

What's worse, the c.start(p) can sometimes hang,

Can you provide more information about this, log or something ?

WeiZhang555 commented 7 years ago

Can you provide more information about this, log or something ?

This happens during my pressure test, see #366. Debug log is too long, and I can't find useful error message from the log.

I can see you are running your test with runv native command (e.g. runv start), what I was talking about is calling runv from docker daemon, which means docker daemon ==> runv-containerd. You added an exit event, which will notify docker daemon to mark the container as exited, it's good and improved the leaking container problem a little bit. But INDEED the container is not exited, it was never started successfully, so this mechanism is not quite right.

Besides that, c.start() hangs will make this situation worse, we need to find why it hangs.

Sorry again that I can't provide the log, because it's truly toooooo long and I can't find useful info for this bug.

wrouesnel commented 7 years ago

I don't think sending exit before start-container makes sense. From a libcontainerd API consumer perspective, nothing happened because a start failure implies the VM itself failed to bring the container up - it never actually entered it's normal runtime process.

Crazykev commented 7 years ago

@wrouesnel Yes, I totally agree with you that we should not

sending exit before start-container

This should be a new state/singal, ie internal failure, to be sent to client to explain why we start container failed instead of hang forever for now.

@WeiZhang555 And also, There are more things to change in runv, so maybe close this or consider again whether this issue still exist in "new" runv. Sorry for the delay.

laijs commented 7 years ago

@Crazykev @wrouesnel @WeiZhang555 runv cli is being improved continuously and had just been refactored at #537 . As a result, it seams that the changes in this pr is outdated. Thank you for your contribution.