Closed chengwang86 closed 7 years ago
Here is the PL log:
May 19 2017 19:43:43.506Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/store.Init:67] ds://nfsDatastore/vch-3
May 19 2017 19:43:43.506Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/store.NewDatastoreKeyValue:104] apiKV
May 19 2017 19:43:43.506Z DEBUG op=259.1 (delta:8.061µs): [NewOperation] op=259.1 (delta:1.511µs) [github.com/vmware/vic/lib/portlayer/store.NewDatastoreKeyValue:115]
May 19 2017 19:43:43.530Z INFO Creating directory [nfsDatastore] vch-3/kvStores
May 19 2017 19:43:43.548Z INFO Datastore path is [nfsDatastore] vch-3/kvStores
May 19 2017 19:45:50.895Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/store.NewDatastoreKeyValue:104] [2m7.388458551s] apiKV
May 19 2017 19:45:50.901Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/store.Init:67] [2m7.394549999s] ds://nfsDatastore/vch-3
May 19 2017 19:45:50.901Z FATAL could not initialize port layer: unable to create apiKV datastore backed store: 500 Internal Server Error
No more logs after the last line.
After restarting the VCH, the portlayer got initialized. However, I got the same persona error when performing docker rm -f
. But the PL log error is
May 19 2017 20:09:21.863Z DEBUG [BEGIN] [github.com/vmware/vic/lib/apiservers/portlayer/restapi/handlers.(*ContainersHandlersImpl).RemoveContainerHandler:236] f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823
May 19 2017 20:09:21.863Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/exec.(*Container).Remove:482] f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823
May 19 2017 20:09:21.863Z DEBUG Setting container f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823 state: Removing
May 19 2017 20:09:21.863Z DEBUG get vm properties [config.files.vmPathName] of vm VirtualMachine:vm-326
May 19 2017 20:09:21.863Z DEBUG properties: [config.files.vmPathName summary.runtime.connectionState]
May 19 2017 20:09:22.533Z DEBUG unexpected soap fault on task retry : types.MethodDisabled{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}, Source:"VIC"}
May 19 2017 20:09:22.543Z DEBUG unexpected error on task retry : soap.soapFaultError{fault:(*soap.Fault)(0xc420600190)}
May 19 2017 20:09:22.549Z DEBUG Do not fix non invalid state error
May 19 2017 20:09:22.549Z DEBUG Setting container f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823 state: Stopped
May 19 2017 20:09:22.549Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/exec.(*Container).Remove:482] [685.905147ms] f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823
May 19 2017 20:09:22.549Z DEBUG [ END ] [github.com/vmware/vic/lib/apiservers/portlayer/restapi/handlers.(*ContainersHandlersImpl).RemoveContainerHandler:236] [685.985508ms] f0270ca3b265e57c1c62db4246f88742b82117457faed642638c00e5edcd0823
Seen in 5-4-High-Availability.zip
5-4-High-Availability.zip from functional_logs_05_20_18_09_vic_10681
@fdawg4l @mhagen-vmware This is blocking us debugging #4858
In the attached log bundle, docker rm -f container_id
failed and the container with container_id
was a busybox
container. Then later on at the end of the regression test when docker rmi busybox
was performed (where #4858 should occur), we observed Failed to remove image "busybox": image 82c7ed4295eafe827b1ab2915f345e40d95dff508cdcf4eb772a48e427838eed in use by b90e56be3a448b069f179697d47d307cfeb545922e058d4f97d47147b
, which is different from the error msg we expected in #4858
I believe there must be a certain error in our latest code changes that caused this issue since (1) it is reproducible (after HA) and (2) it never happened in builds prior to 9992 (9992 is OK).
From hostd log,
2017-05-21T02:44:43.764Z info hostd[12E97B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] Event 251 : Reconfigured hopeful_wozniak-a7f5750e4a7b on sc-rdops-vm05-dhcp-136-158.eng.vmware.com in ha-datacenter.
-->
--> Modified:
-->
--> Added:
-->
--> Deleted:
-->
-->
2017-05-21T02:44:43.764Z verbose hostd[12E97B70] [Originator@6876 sub=PropertyProvider opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: disabledMethod, 3. Sent notification immediately.
2017-05-21T02:44:43.764Z verbose hostd[12E97B70] [Originator@6876 sub=PropertyProvider opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: capability, 3. Sent notification immediately.
2017-05-21T02:44:43.765Z verbose hostd[12E97B70] [Originator@6876 sub=PropertyProvider opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] RecordOp ASSIGN: summary, 3. Sent notification immediately.
2017-05-21T02:44:43.765Z info hostd[12E97B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/05f0e78b-41d55e6a/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26.vmx opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] Send config update invoked
2017-05-21T02:44:43.803Z info hostd[12E97B70] [Originator@6876 sub=Libs opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] OBJLIB-LIB: ObjLib_GetNameSpaceObjectUniqueIdFromPath : failed backend does not store object unique id in path
2017-05-21T02:44:43.803Z warning hostd[12E97B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/05f0e78b-41d55e6a/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26.vmx opID=1fc7986e-28-c0c6 user=vpxuser:VSPHERE.LOCAL\Administrator] File - failed to get objectId, '/vmfs/volumes/05f0e78b-41d55e6a/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26/a7f5750e4a7bbd24957d60f0db233d23e9c5a0630ed1a51ac8cb8833f67c5e26.vmx': One of the parameters supplied is invalid.
Here a7f575...
is the id of the container on which docker rm -f
was performed and the above error was thrown.
@chengwang86 Not sure if this is useful or just adding to the noise, but just in case: This is from 5/19 nightly build. 5-4-High-Availability.zip
@jzt Thanks. The one you posted is the same with the one I posted :)
Update:
I tried running the test manually using build 10251 and it succeeded. In the hostd log, I found the same error msgs as above. So the above hostd log snippet might not be related to the failure.
D'OH, I'm making my way through this weekend's logs, if I find any new ones I'll let you know, although I suspect they might not show anything new/useful.
Update:
I did binary search on the previous builds. It turns out that build 10290 is the first build that this error occurs. This relates to PR https://github.com/vmware/vic/pull/4965.
@dougm Any thoughts?
@chengwang86 I haven't seen MethodDisabled error when using the API. We may just need to call AuthorizationManager.EnableMethods before destroying the VM. Can you try that?
Update:
Calling AuthorizationManager.EnableMethods
before removing the containerVM can fix the issue, although we still don't understand why this is needed after HA.
More explanations: https://github.com/vmware/vic/pull/5219#issuecomment-303453228
Seen again in the nightly test functional_logs_05_23_18_09_vic_10681. Logs attached here 5-4-High-Availability.zip
Previously when Cheng closed this the issue was fixed but hadn't made its way into the nightlies cycle so it showed up as a fail the next day. However, after the fix made it into the nightly rotation this bug didn't appear anymore. Closing unless seen again.
For bug reports, please include the information below:
VIC version:
Build 10681
Deployment details:
Run 5-4-High-Availability test manually. The error occurs at this step https://github.com/vmware/vic/blob/master/tests/manual-test-cases/Group5-Functional-Tests/5-4-High-Availability.robot#L201
Actual behavior:
Expected behavior:
docker rm -f
should finish successfully.Logs. The original logs for the operations when this error occurred cannot be found in our nightly test since we restarted the VCH. However, I was able to reproduce the error by running the test manually. See the attached log bundle. container-logs (1).zip
Note: This error does not appear in the same test under build 9992 and the builds prior to 9992. It always fails starting from build 10634. There is no nightly test using builds between 9992 and 10634.
Logs: