vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 173 forks source link

Failure in docker kill test: signal container with default signal #1946

Closed mhagen-vmware closed 8 years ago

mhagen-vmware commented 8 years ago

Build 3922 failed with the following issue:

time="2016-08-12T15:53:27Z" level=debug msg="[BEGIN] [github.com/vmware/vic/lib/apiservers/engine/backends.(*Container).ContainerKill] 00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113" 
time="2016-08-12T15:53:27Z" level=debug msg="[ END ] [github.com/vmware/vic/lib/apiservers/engine/backends.(*Container).ContainerKill] [331.580167ms] 00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113" 
time="2016-08-12T15:53:27Z" level=error msg="Handler for POST /v1.21/containers/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/kill returned error: Cannot kill container 00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113: [POST /containers/{id}/signal][500] containerSignalInternalServerError  &{Code:<nil> Message:ServerFaultCode: A general system error occurred: vix error codes = (3016, 0).\n}" 

integration_logs_3922_57a4eb202703eeb7bb9e386dc590256b22fa1ee9_1471018933.zip

mdubya66 commented 8 years ago

Adding @hmahmood as @dougm is out this week and we can't go another week of flakey CI

caglar10ur commented 8 years ago

3016 according to https://www.vmware.com/support/developer/vix-api/vix17_reference/errors/errors.html VIX_E_TOOLS_NOT_RUNNING Guest tools is not running.

mdubya66 commented 8 years ago

It's also sporadic

dougm commented 8 years ago

Likely another case where we need more time in between start and kill (or stop). Either a sleep 1 or poll for tools state with govc vm.infohttp://vm.info -json, we do similar in .robot checking for power state. Note that tether start waits until toolbox is initialized before start returns, but may not mean the vmx state considers tools running quite yet.

Sent from my iPhone

On Aug 12, 2016, at 9:40 AM, Mikkel Hagen notifications@github.com<mailto:notifications@github.com> wrote:

Assigned #1946https://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_vmware_vic_issues_1946&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=ngcxGMoOE3j4fNiES7HgVXY0PFY3dy0ijQpYsCa-tnI&s=KCJOWd0_2WCDwy3W6tntBhOQzG6JaZo3Lly-6lDiMKc&e= to @dougmhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_dougm&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=ngcxGMoOE3j4fNiES7HgVXY0PFY3dy0ijQpYsCa-tnI&s=sNmmk-HPdAKU2QFpU_oO_P3NKeYzNesblIwJOQKv_xk&e=.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vmware_vic_issues_1946-23event-2D754234848&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=ngcxGMoOE3j4fNiES7HgVXY0PFY3dy0ijQpYsCa-tnI&s=OUDIPpCJd5A3tocdxMdYhZ4KEQiTK25Y0gP0Mche47o&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAB12yJ7vI6Kpb58wObPJ5s1SsG0lzUwks5qfKHmgaJpZM4JjQVa&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=ngcxGMoOE3j4fNiES7HgVXY0PFY3dy0ijQpYsCa-tnI&s=K0SZY7OCAf4kNjBTzLgHUWp1Sb5STucPU4VuyK92s9c&e=.

caglar10ur commented 8 years ago

Yep, that's also what I suspected but unfortunately clocks between host/guest and drone was out sync so it's hard to make sure about ordering of events. Looking at the other logs with the hope of spotting something, at worst case we can sleep/poll

caglar10ur commented 8 years ago
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:27.924Z| vmx| I120: Vix: [3038315 mainDispatch.c:1082]: VMAutomation_PowerOn. Powering on.
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:28.045Z| vcpu-0| I120: Vix: [3038322 mainDispatch.c:4292]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1872, success=1 additionalError=0
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:28.045Z| vcpu-0| I120: Vix: [3038322 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=0, err=0).
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:28.045Z| vcpu-0| I120: Vix: [3038322 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:36.110Z| vcpu-1| I120: Vix: [3038323 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:37.108Z| vcpu-1| I120: Vix: [3038323 guestCommands.c:1985]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): A guest cmd cannot be sent because the connection to the guest was closed; giving up, op=185.
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:37.332Z| vmx| I120: Vix: [3038315 mainDispatch.c:1188]: VMAutomationPowerOff: Powering off.
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:39.865Z| vmx| I120: Vix: [3038463 mainDispatch.c:1082]: VMAutomation_PowerOn. Powering on.
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:39.928Z| vcpu-0| I120: Vix: [3038470 mainDispatch.c:4292]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1872, success=1 additionalError=0
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:39.928Z| vcpu-0| I120: Vix: [3038470 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=0, err=0).
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:39.928Z| vcpu-0| I120: Vix: [3038470 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:47.908Z| vcpu-1| I120: Vix: [3038471 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:50.990Z| vmx| I120: Vix: [3038463 mainDispatch.c:1188]: VMAutomationPowerOff: Powering off.
caglar10ur commented 8 years ago

@dougm still reading the code but can't we use the ping trick that you used in tests to make sure that we at least receive one ping request before returning from start? I think I'll try that - unless pings are not issues by esxi

dougm commented 8 years ago

Tether waits past when the first ping happens, waits for the PowerOn callback. Not sure if we should have tether start wait even longer. You can see all rpc traffic with toolbox.Trace = true. Could also have PL kill wait for tools running state if needed. But probably simpler to just do so in the test for now, to see if that is indeed the issue.

Sent from my iPhone

On Aug 12, 2016, at 10:51 AM, S.?a?lar Onur notifications@github.com<mailto:notifications@github.com> wrote:

@dougmhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dougm&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=axkhUU9yFQuXUwAqfOwmsQkAETsIHVcN-LtcU-j4nt4&s=3cepcBPeLcdTjZClTKPQqizu2vOyPzeMKuVoZYFePd0&e= still reading the code but can't we use the ping trick that you used in tests to make sure that we at least receive one ping request before returning from start? I think I'll try that - unless pings are not issues by esxi

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vmware_vic_issues_1946-23issuecomment-2D239514745&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=axkhUU9yFQuXUwAqfOwmsQkAETsIHVcN-LtcU-j4nt4&s=QFwbNP40vSYACojtL4p2cG9Kd_Kmkxh7X_OOUAEbfJE&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAB1269yiI5vprqQ-2DNyatmCcUjWfclrQks5qfLKpgaJpZM4JjQVa&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=axkhUU9yFQuXUwAqfOwmsQkAETsIHVcN-LtcU-j4nt4&s=WFhtOrGeQuSdjYUutOMMAXOWO8rsPvoIf4TxXHbtZEI&e=.

caglar10ur commented 8 years ago
2255 2016-08-12T15:52:36.051Z| vcpu-1| W110: GuestRpc: application toolbox, changing channel 65535 -> 0
2256 2016-08-12T15:52:36.051Z| vcpu-1| I120: GuestRpc: Channel 0, guest application toolbox.
2257 2016-08-12T15:52:36.053Z| vcpu-1| I120: TOOLS Received tools.set.version rpc call, version = 2147483647.
2258 2016-08-12T15:52:36.053Z| vcpu-1| I120: TOOLS setting legacy tools version to '2147483647', manifest status is 8
2259 2016-08-12T15:52:36.053Z| vcpu-1| I120: ToolsSetVersionWork: Updating the Tools Version in the disk
2260 2016-08-12T15:52:36.096Z| vcpu-1| I120: ToolsSetVersionWork: Done updating the Tools Version in the disk
2261 2016-08-12T15:52:36.096Z| vcpu-1| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'unmanaged', 'unmanaged', install possible
2262 2016-08-12T15:52:36.110Z| vcpu-1| I120: TOOLS state change 3 returned status 1
2263 2016-08-12T15:52:36.110Z| vcpu-1| I120: Vix: [3038323 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
2264 2016-08-12T15:52:36.225Z| vmx| A115: ConfigDB: Setting extendedConfigFile = "00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113.vmxf"
2265 2016-08-12T15:52:36.291Z| vcpu-1| I120: VMXNET3 user: Ethernet0 Driver Info: version = 17040640 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2266 2016-08-12T15:52:36.648Z| vcpu-1| A115: ConfigDB: Setting guestinfo..sessions|00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113.started = "true"
2267 2016-08-12T15:52:36.863Z| vmx| I120: VigorTransportProcessClientPayload: opID=90618614 seq=634: Receiving GuestOps.StartProgram request.
2268 2016-08-12T15:52:36.998Z| vcpu-1| A115: ConfigDB: Setting guestinfo..sessions|00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113.status = "-1"
2269 2016-08-12T15:52:37.108Z| vcpu-1| I120: GuestRpc: Reinitializing Channel 0(toolbox)
2270 2016-08-12T15:52:37.108Z| vcpu-1| I120: Vix: [3038323 guestCommands.c:1985]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): A guest cmd cannot be sent because the connection to the guest was closed; giving up, op=185.
2271 2016-08-12T15:52:37.108Z| vcpu-1| I120: VigorTransport_ServerSendResponse opID=90618614 seq=634: Completed GuestOps request.
2272 2016-08-12T15:52:37.224Z| vcpu-0| I120: PIIX4: PM Soft Off.  Good-bye.
2273 2016-08-12T15:52:37.224Z| vcpu-0| I120: Chipset: Issuing power-off request...

another snippet. This shows that we register just fine. Following events are interesting though

2269 2016-08-12T15:52:37.108Z| vcpu-1| I120: GuestRpc: Reinitializing Channel 0(toolbox)
2270 2016-08-12T15:52:37.108Z| vcpu-1| I120: Vix: [3038323 guestCommands.c:1985]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): A guest cmd cannot be sent because the connection to the guest was closed; giving up, op=185.

cough internal code cough shows that this is happening because of a VMX reset, VMX is flushing all pending rpc requests and signals the clients to send them again - if needed.

caglar10ur commented 8 years ago

I keep running the tests, this time it failed with

Installer completed successfully: VCH-3950-8217...
Signal a container with default kill signal                           | PASS |
------------------------------------------------------------------------------
Signal a container with SIGHUP                                        | FAIL |
Keyword 'Assert Kill Signal' failed after retrying 5 times. The last error was: '' does not contain 'KillSignalHUP'
------------------------------------------------------------------------------
Signal a non-existent container                                       | PASS |
------------------------------------------------------------------------------
Gathering logs from the test server...
Deleting the VCH appliance...
1-14-Docker-Kill :: Test 1-14 - Docker Kill                           | FAIL |
3 critical tests, 2 passed, 1 failed
3 tests total, 2 passed, 1 failed
==============================================================================
hmahmood commented 8 years ago

@caglar10ur did you take a look at the logs?

caglar10ur commented 8 years ago

@hmahmood https://console.cloud.google.com/m/cloudstorage/b/vic-ci-logs/o/integration_logs_3950_646c4317f3d7dad2776f8af8e6fc3598be3244bb_1471031058.tar

hmahmood commented 8 years ago

From the logs it looks like we gave up too soon. Maybe 5x is not enough?

caglar10ur commented 8 years ago

Another failure, this time ESXi shows a protocol error caused by toolbox's reply

https://console.cloud.google.com/m/cloudstorage/b/vic-ci-logs/o/integration_logs_4003_d8aeeabe4215bb364546b7e39c13766670f6bbe2_1471283409.tar

2016-08-15T17:48:25.526Z| vcpu-1| W110: GuestRpc: application toolbox, changing channel 65535 -> 0
2016-08-15T17:48:25.526Z| vcpu-1| I120: GuestRpc: Channel 0, guest application toolbox.
2016-08-15T17:48:25.529Z| vcpu-1| I120: TOOLS Received tools.set.version rpc call, version = 2147483647.
2016-08-15T17:48:25.529Z| vcpu-1| I120: TOOLS setting legacy tools version to '2147483647', manifest status is 8
2016-08-15T17:48:25.530Z| vcpu-1| I120: ToolsSetVersionWork: Updating the Tools Version in the disk
2016-08-15T17:48:25.651Z| vcpu-1| I120: ToolsSetVersionWork: Done updating the Tools Version in the disk
2016-08-15T17:48:25.651Z| vcpu-1| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'unmanaged', 'unmanaged', install possible
2016-08-15T17:48:25.660Z| vcpu-1| I120: TOOLS state change 3 returned status 1
2016-08-15T17:48:25.660Z| vcpu-1| I120: Vix: [3078056 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
2016-08-15T17:48:25.830Z| vmx| A115: ConfigDB: Setting extendedConfigFile = "3b89711432c0e312a4e6645390170e08c9d51b7e8b054e342c5e14913cc71c4c.vmxf"
2016-08-15T17:48:25.858Z| vcpu-1| I120: VMXNET3 user: Ethernet0 Driver Info: version = 17040640 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2016-08-15T17:48:55.591Z| vcpu-1| I120: GuestRpc: Channel 0, protocol error: expected an OK or ERROR reply message from the guest application toolbox
2016-08-15T17:48:55.591Z| vcpu-1| I120: GuestRpc: Reinitializing Channel 0(toolbox)
2016-08-15T17:48:55.591Z| vcpu-1| I120: GuestMsg: Channel 0, Cannot unpost because the previous post is already completed
2016-08-15T17:48:55.603Z| vcpu-1| W110: GuestRpc: application toolbox, changing channel 65535 -> 0
2016-08-15T17:48:55.603Z| vcpu-1| I120: GuestRpc: Channel 0, guest application toolbox.
2016-08-15T17:48:55.603Z| vcpu-1| I120: TOOLS Received tools.set.version rpc call, version = 2147483647.
2016-08-15T17:48:55.603Z| vcpu-1| I120: ToolsSetVersionWork did nothing; new tools version (2147483647) matches old Tools version
2016-08-15T17:48:55.603Z| vcpu-1| I120: Vix: [3078056 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
hmahmood commented 8 years ago

@caglar10ur is that just a timeout?

caglar10ur commented 8 years ago

@hmahmood not sure, this time it looks like start failed. We don't have all the logs (like debug) and vmware.log shows toolbox timeouts but it manages to register itself so that means tether was also running.

hmahmood commented 8 years ago

Sorry I should have put this in my previous comment. I asked because the timestamps around the error differ by ~30s:

2016-08-15T17:48:25.858Z| vcpu-1| I120: VMXNET3 user: Ethernet0 Driver Info: version = 17040640 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2016-08-15T17:48:55.591Z| vcpu-1| I120: GuestRpc: Channel 0, protocol error: expected an OK or ERROR reply message from the guest application toolbox
dougm commented 8 years ago

In trying to reproduce the kill test failures, I ran into #2327

dougm commented 8 years ago

With the kill tests enabled, I am able to sporadically reproduce the issue with #2327 where tether fails to initialize due to the eth0 having not yet presented its PCI path. I have a band-aid in place to retry the PCI patch and continuing to hammer away to see if these tests uncover any other issues.