vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 173 forks source link

containerRemoveInternalServerError in 5-4-High-Availability nightly test #5198

Closed chengwang86 closed 7 years ago

chengwang86 commented 7 years ago

For bug reports, please include the information below:

VIC version:

Build 10681

Deployment details:

Run 5-4-High-Availability test manually. The error occurs at this step https://github.com/vmware/vic/blob/master/tests/manual-test-cases/Group5-Functional-Tests/5-4-High-Availability.robot#L201

Actual behavior:

docker -H 10.160.72.57:2375 ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                     PORTS               NAMES
db8038f571f0        busybox             "ls"                5 minutes ago       Exited (0) 5 minutes ago                       tender_poitras
b8c0f179726e        busybox             "ls"                6 minutes ago       Exited (0) 6 minutes ago                       priceless_borg
2fd02be27c66        busybox             "ls"                7 minutes ago       Exited (0) 6 minutes ago                       gifted_meitner
e28263285d83        busybox             "sh"                8 minutes ago       Up 8 minutes                                   sad_visvesvaraya
f0270ca3b265        busybox             "sh"                8 minutes ago       Up 8 minutes                                   thirsty_shockley
7fce212a0a3f        busybox             "sh"                9 minutes ago       Up 9 minutes                                   jolly_kalam

docker -H 10.160.72.57:2375 rm -f 7f
Error response from daemon: Server error from portlayer: [DELETE /containers/{id}][500] containerRemoveInternalServerError

Expected behavior: docker rm -f should finish successfully.

Logs. The original logs for the operations when this error occurred cannot be found in our nightly test since we restarted the VCH. However, I was able to reproduce the error by running the test manually. See the attached log bundle. container-logs (1).zip

Note: This error does not appear in the same test under build 9992 and the builds prior to 9992. It always fails starting from build 10634. There is no nightly test using builds between 9992 and 10634.

Logs:

chengwang86 commented 7 years ago

1823 tracks all the occurrences of this error in our ci and nightly tests. I created this ticket separately since it was reproducible and I guess it might be related to our recent changes that might affect HA.

chengwang86 commented 7 years ago

Here is the PL log:

May 19 2017 19:43:43.506Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/store.Init:67] ds://nfsDatastore/vch-3
May 19 2017 19:43:43.506Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/store.NewDatastoreKeyValue:104] apiKV
May 19 2017 19:43:43.506Z DEBUG op=259.1 (delta:8.061µs): [NewOperation] op=259.1 (delta:1.511µs) [github.com/vmware/vic/lib/portlayer/store.NewDatastoreKeyValue:115]
May 19 2017 19:43:43.530Z INFO  Creating directory [nfsDatastore] vch-3/kvStores
May 19 2017 19:43:43.548Z INFO  Datastore path is [nfsDatastore] vch-3/kvStores
May 19 2017 19:45:50.895Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/store.NewDatastoreKeyValue:104] [2m7.388458551s] apiKV
May 19 2017 19:45:50.901Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/store.Init:67] [2m7.394549999s] ds://nfsDatastore/vch-3
May 19 2017 19:45:50.901Z FATAL could not initialize port layer: unable to create apiKV datastore backed store: 500 Internal Server Error

No more logs after the last line.

chengwang86 commented 7 years ago

After restarting the VCH, the portlayer got initialized. However, I got the same persona error when performing docker rm -f. But the PL log error is

May 19 2017 20:09:21.863Z DEBUG [BEGIN] [github.com/vmware/vic/lib/apiservers/portlayer/restapi/handlers.(*ContainersHandlersImpl).RemoveContainerHandler:236] f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823
May 19 2017 20:09:21.863Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/exec.(*Container).Remove:482] f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823
May 19 2017 20:09:21.863Z DEBUG Setting container f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823 state: Removing
May 19 2017 20:09:21.863Z DEBUG get vm properties [config.files.vmPathName] of vm VirtualMachine:vm-326
May 19 2017 20:09:21.863Z DEBUG properties: [config.files.vmPathName summary.runtime.connectionState]
May 19 2017 20:09:22.533Z DEBUG unexpected soap fault on task retry : types.MethodDisabled{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}, Source:"VIC"}
May 19 2017 20:09:22.543Z DEBUG unexpected error on task retry : soap.soapFaultError{fault:(*soap.Fault)(0xc420600190)}
May 19 2017 20:09:22.549Z DEBUG Do not fix non invalid state error
May 19 2017 20:09:22.549Z DEBUG Setting container f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823 state: Stopped
May 19 2017 20:09:22.549Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/exec.(*Container).Remove:482] [685.905147ms] f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823
May 19 2017 20:09:22.549Z DEBUG [ END ] [github.com/vmware/vic/lib/apiservers/portlayer/restapi/handlers.(*ContainersHandlersImpl).RemoveContainerHandler:236] [685.985508ms] f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823

container-logs (2).zip

jzt commented 7 years ago

Seen in 5-4-High-Availability.zip

chengwang86 commented 7 years ago

5-4-High-Availability.zip from functional_logs_05_20_18_09_vic_10681

@fdawg4l @mhagen-vmware This is blocking us debugging #4858

In the attached log bundle, docker rm -f container_id failed and the container with container_id was a busybox container. Then later on at the end of the regression test when docker rmi busybox was performed (where #4858 should occur), we observed Failed to remove image "busybox": image 82c7ed4295eafe827b1ab2915f345e40d95dff508cdcf4eb772a48e427838eed in use by b90e56be3a448b069f179697d47d307cfeb545922e058d4f97d47147b , which is different from the error msg we expected in #4858

I believe there must be a certain error in our latest code changes that caused this issue since (1) it is reproducible (after HA) and (2) it never happened in builds prior to 9992 (9992 is OK).

chengwang86 commented 7 years ago

From hostd log,

2017-05-21T02:44:43.764Z info hostd[12E97B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] Event 251 : Reconfigured hopeful_wozniak-a7f5750e4a7b on sc-rdops-vm05-dhcp-136-158.eng.vmware.com in ha-datacenter.
-->
--> Modified:
-->
-->  Added:
-->
-->  Deleted:
-->
-->
2017-05-21T02:44:43.764Z verbose hostd[12E97B70] [Originator@6876 sub=PropertyProvider opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: disabledMethod, 3. Sent notification immediately.
2017-05-21T02:44:43.764Z verbose hostd[12E97B70] [Originator@6876 sub=PropertyProvider opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: capability, 3. Sent notification immediately.
2017-05-21T02:44:43.765Z verbose hostd[12E97B70] [Originator@6876 sub=PropertyProvider opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: summary, 3. Sent notification immediately.
2017-05-21T02:44:43.765Z info hostd[12E97B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/05f0e78b-41d55e6a/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26.vmx opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] Send config update invoked
2017-05-21T02:44:43.803Z info hostd[12E97B70] [Originator@6876 sub=Libs opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] OBJLIB-LIB: ObjLib_GetNameSpaceObjectUniqueIdFromPath : failed backend does not store object unique id in path
2017-05-21T02:44:43.803Z warning hostd[12E97B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/05f0e78b-41d55e6a/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26.vmx opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] File - failed to get objectId, '/vmfs/volumes/05f0e78b-41d55e6a/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26.vmx': One of the parameters supplied is invalid.

Here a7f575... is the id of the container on which docker rm -f was performed and the above error was thrown.

jzt commented 7 years ago

@chengwang86 Not sure if this is useful or just adding to the noise, but just in case: This is from 5/19 nightly build. 5-4-High-Availability.zip

chengwang86 commented 7 years ago

@jzt Thanks. The one you posted is the same with the one I posted :)

chengwang86 commented 7 years ago

Update:

I tried running the test manually using build 10251 and it succeeded. In the hostd log, I found the same error msgs as above. So the above hostd log snippet might not be related to the failure.

jzt commented 7 years ago

D'OH, I'm making my way through this weekend's logs, if I find any new ones I'll let you know, although I suspect they might not show anything new/useful.

chengwang86 commented 7 years ago

Update:

I did binary search on the previous builds. It turns out that build 10290 is the first build that this error occurs. This relates to PR https://github.com/vmware/vic/pull/4965.

@dougm Any thoughts?

dougm commented 7 years ago

@chengwang86 I haven't seen MethodDisabled error when using the API. We may just need to call AuthorizationManager.EnableMethods before destroying the VM. Can you try that?

chengwang86 commented 7 years ago

Update:

Calling AuthorizationManager.EnableMethods before removing the containerVM can fix the issue, although we still don't understand why this is needed after HA.

chengwang86 commented 7 years ago

More explanations: https://github.com/vmware/vic/pull/5219#issuecomment-303453228

chengwang86 commented 7 years ago

Seen again in the nightly test functional_logs_05_23_18_09_vic_10681. Logs attached here 5-4-High-Availability.zip

sgairo commented 7 years ago

Previously when Cheng closed this the issue was fixed but hadn't made its way into the nightlies cycle so it showed up as a fail the next day. However, after the fix made it into the nightly rotation this bug didn't appear anymore. Closing unless seen again.