vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.

http://vmware.github.io/vic

Other

640 stars 173 forks source link

1-38-Docker-Exec failure in Exec During PowerOff test #6744

Closed anchal-agrawal closed 6 years ago

anchal-agrawal commented 6 years ago

Seen in build https://ci.vcna.io/vmware/vic/14606 in the 1-38-Docker-Exec suite:

------------------------------------------------------------------------------
Exec During PowerOff                                                  | FAIL |
' Error response from daemon: Server error from portlayer: Cannot find task d313fb6d4b5b3021b8692ce78c44395d183e7e365bad4725889a9e96fc7ac74a
Error response from daemon: Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict  &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}
Error response from daemon: Server error from portlayer: The operation is not allowed in the current state.
Error response from daemon: Server error from portlayer: Cannot find task 441203c3924ede3a6da11582b16b873f3bb80d508bc8af27ffcb46d6fbc0d348
Error response from daemon: Server error from portlayer: Cannot find task 20146b554a2e17500ecfe5cf0d697274030bc08eb5e5843b71a671aa47c8953b
Error response from daemon: Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict  &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}
Error response from daemon: Server error from portlayer: Cannot find task 6834083aabf4f65385d4d6a719a7c25f3c84a7bca459973bf0cb6e31a9657961
Error response from daemon: Server error from portlayer: The operation is not allowed in the current state.
Error response from daemon: Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict  &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}
' does not contain 'Cannot complete the operation, container 2982075f4b15a8509b418f26e64ce204e7d5cba2da752e7f3883b5d1a694b043 has been powered off during execution'
------------------------------------------------------------------------------

Logs: Test-Cases.Group1-Docker-Commands.1-38-Docker-Exec-VCH-14606-7800-container-logs.zip

anchal-agrawal commented 6 years ago

Assigned to team/foundation for further triage.

cgtexmex commented 6 years ago

@matthewavery I'm looking to you to run with this...

matthewavery commented 6 years ago

@cgtexmex Alright I will triage and investigate today.

matthewavery commented 6 years ago

Note: these are some preliminary notes from my initial triage of this failure.

Analysis of Exec during poweroff failure

Container of Interest: d2c46697a1b36066acec660a3676cd69be9534bfa592f17713ea1846edb7ae78

Portlayer Invesitagion:

Looks like in the portlayer we are hitting the sub case below our poweroff check... because of this we are still not looking in etasks when perhaps we should be... either the runtime was nil or the vm changed state right after we checked(shouldn't really happen tmk). because of this we do not actually switch to checking if the etasks list has our id since we never change task to point to etasks.

matthewavery commented 6 years ago

related to #6370

matthewavery commented 6 years ago

To resolve this we will need to add more structured error return to the exec paths. here are some things that are needed up invesitgation:

Many of the personality Exec path calls need to be abstracted to the container proxy
A structured power failure error needs to be added to swagger calls which should then be translated to a sane error message returned by all of the mid flight exec paths.
More structured error messages must be propagated back from the portlayer using the new power failure errors added in the above bullet point.
more power state logic will likely need to be added to the portlayer exec path.

matthewavery commented 6 years ago

per a long analyasis with @hickeng :

THINGS TO DO:

1. create a `TaskInspect` to container proxy and update the call to use the same handle as everything else in Container Exec Start. TaskInspection should go in the retry with everything else and the entire operation should be retried instead of just part. We need to seek a new handle and perform the config additions again. 

2. make sure everything in start is moved to be inside the retry. If we get a concurrent mod error then we must try everything from scratch with a new handle. 

3. We need to change TaskInspect from the portlayer perspective to behave like this: 

func Inspect(op *trace.Operation, h interface{}, id string) (*executor.SessionConfig, error) {
    defer trace.End(trace.Begin(id))

    handle, ok := h.(*exec.Handle)
    if !ok {
        return nil, fmt.Errorf("Type assertion failed for %#+v", handle)
    }

    stasks := handle.ExecConfig.Sessions
    etasks := handle.ExecConfig.Execs

    op.Debugf("target task ID: %s", id)
    op.Debugf("session tasks during inspect: %s", stasks)
    op.Debugf("exec tasks during inspect: %s", etasks)

    if _, ok := stasks[id]; ok {
        return stasks[id], nil
    }

    if _, ok := etasks[id]; ok {
        return etasks[id], nil
    }

    // this error should actually be typed. So that we can determine a 404 situation in the handlers
    return nil, fmt.Errorf("Cannot find task %s", id)
}

4. IMPORTANT: if TaskInspect gives us a 404 we must properly interpret this in the personality. ExecTaskInspect --> 404 --> likely the container was turned off. That should be the order of interpretation. 

5. We need to handle concurrent modification errors properly in the exec path... We cannot just throw the entire packaged error into a `500` error string. 

6. more to come... from notes on my desk...

mlh78750 commented 6 years ago

What happened with this issue? I see the PR but I'm not sure why this went back to ToDo. Is this still expected in 22?

matthewavery commented 6 years ago

@mlh78750 it was determined that this was not included in 1.3. since I have been working on other high priority issues and that PR is not fully tested(with a lot of code overhaul) it has been tabled for the moment, in order to get 1.3 squared away.

mdubya66 commented 6 years ago

This problem has existed since 1.1 and is not a regression. The PR that is out against it is too risky at this point. Removing it from the release. Try the exec again and you will find the container is off.

matthewavery commented 6 years ago

7410 has been created to follow some of the issues that we are still seeing.

This ticket will be closed by PR #6787 as it has fixed many many issues with poweroff exec and concurrent execs.

This issue ended up encompassing many many more issues that it appeared. As each unique error involved several different problems being fixed on the exec path, and some of the errors came from different paths but yielded the same error. If you are interested in seeing what some of these were please check out the above PR(warning! it is very large). As such this ticket will be close and #7410 has been made to track what is left after #6787 is merged.