Closed anchal-agrawal closed 6 years ago
Assigned to team/foundation for further triage.
@matthewavery I'm looking to you to run with this...
@cgtexmex Alright I will triage and investigate today.
Note: these are some preliminary notes from my initial triage of this failure.
Container of Interest: d2c46697a1b36066acec660a3676cd69be9534bfa592f17713ea1846edb7ae78
Looks like in the portlayer we are hitting the sub case below our poweroff check... because of this we are still not looking in etasks
when perhaps we should be... either the runtime was nil or the vm changed state right after we checked(shouldn't really happen tmk). because of this we do not actually switch to checking if the etasks list has our id since we never change task
to point to etasks
.
related to #6370
To resolve this we will need to add more structured error return to the exec paths. here are some things that are needed up invesitgation:
per a long analyasis with @hickeng :
THINGS TO DO:
1. create a `TaskInspect` to container proxy and update the call to use the same handle as everything else in Container Exec Start. TaskInspection should go in the retry with everything else and the entire operation should be retried instead of just part. We need to seek a new handle and perform the config additions again.
2. make sure everything in start is moved to be inside the retry. If we get a concurrent mod error then we must try everything from scratch with a new handle.
3. We need to change TaskInspect from the portlayer perspective to behave like this:
func Inspect(op *trace.Operation, h interface{}, id string) (*executor.SessionConfig, error) {
defer trace.End(trace.Begin(id))
handle, ok := h.(*exec.Handle)
if !ok {
return nil, fmt.Errorf("Type assertion failed for %#+v", handle)
}
stasks := handle.ExecConfig.Sessions
etasks := handle.ExecConfig.Execs
op.Debugf("target task ID: %s", id)
op.Debugf("session tasks during inspect: %s", stasks)
op.Debugf("exec tasks during inspect: %s", etasks)
if _, ok := stasks[id]; ok {
return stasks[id], nil
}
if _, ok := etasks[id]; ok {
return etasks[id], nil
}
// this error should actually be typed. So that we can determine a 404 situation in the handlers
return nil, fmt.Errorf("Cannot find task %s", id)
}
4. IMPORTANT: if TaskInspect gives us a 404 we must properly interpret this in the personality. ExecTaskInspect --> 404 --> likely the container was turned off. That should be the order of interpretation.
5. We need to handle concurrent modification errors properly in the exec path... We cannot just throw the entire packaged error into a `500` error string.
6. more to come... from notes on my desk...
What happened with this issue? I see the PR but I'm not sure why this went back to ToDo. Is this still expected in 22?
@mlh78750 it was determined that this was not included in 1.3. since I have been working on other high priority issues and that PR is not fully tested(with a lot of code overhaul) it has been tabled for the moment, in order to get 1.3 squared away.
This problem has existed since 1.1 and is not a regression. The PR that is out against it is too risky at this point. Removing it from the release. Try the exec again and you will find the container is off.
This ticket will be closed by PR #6787 as it has fixed many many issues with poweroff exec and concurrent execs.
This issue ended up encompassing many many more issues that it appeared. As each unique error involved several different problems being fixed on the exec path, and some of the errors came from different paths but yielded the same error. If you are interested in seeing what some of these were please check out the above PR(warning! it is very large). As such this ticket will be close and #7410 has been made to track what is left after #6787 is merged.
Seen in build https://ci.vcna.io/vmware/vic/14606 in the
1-38-Docker-Exec
suite:Logs: Test-Cases.Group1-Docker-Commands.1-38-Docker-Exec-VCH-14606-7800-container-logs.zip