Closed mhagen-vmware closed 8 years ago
Adding @hmahmood as @dougm is out this week and we can't go another week of flakey CI
3016 according to https://www.vmware.com/support/developer/vix-api/vix17_reference/errors/errors.html VIX_E_TOOLS_NOT_RUNNING Guest tools is not running.
It's also sporadic
Likely another case where we need more time in between start and kill (or stop). Either a sleep 1 or poll for tools state with govc vm.infohttp://vm.info -json, we do similar in .robot checking for power state. Note that tether start waits until toolbox is initialized before start returns, but may not mean the vmx state considers tools running quite yet.
Sent from my iPhone
On Aug 12, 2016, at 9:40 AM, Mikkel Hagen notifications@github.com<mailto:notifications@github.com> wrote:
Assigned #1946https://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_vmware_vic_issues_1946&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=ngcxGMoOE3j4fNiES7HgVXY0PFY3dy0ijQpYsCa-tnI&s=KCJOWd0_2WCDwy3W6tntBhOQzG6JaZo3Lly-6lDiMKc&e= to @dougmhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_dougm&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=ngcxGMoOE3j4fNiES7HgVXY0PFY3dy0ijQpYsCa-tnI&s=sNmmk-HPdAKU2QFpU_oO_P3NKeYzNesblIwJOQKv_xk&e=.
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vmware_vic_issues_1946-23event-2D754234848&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=ngcxGMoOE3j4fNiES7HgVXY0PFY3dy0ijQpYsCa-tnI&s=OUDIPpCJd5A3tocdxMdYhZ4KEQiTK25Y0gP0Mche47o&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAB12yJ7vI6Kpb58wObPJ5s1SsG0lzUwks5qfKHmgaJpZM4JjQVa&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=ngcxGMoOE3j4fNiES7HgVXY0PFY3dy0ijQpYsCa-tnI&s=K0SZY7OCAf4kNjBTzLgHUWp1Sb5STucPU4VuyK92s9c&e=.
Yep, that's also what I suspected but unfortunately clocks between host/guest and drone was out sync so it's hard to make sure about ordering of events. Looking at the other logs with the hope of spotting something, at worst case we can sleep/poll
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:27.924Z| vmx| I120: Vix: [3038315 mainDispatch.c:1082]: VMAutomation_PowerOn. Powering on.
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:28.045Z| vcpu-0| I120: Vix: [3038322 mainDispatch.c:4292]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1872, success=1 additionalError=0
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:28.045Z| vcpu-0| I120: Vix: [3038322 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=0, err=0).
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:28.045Z| vcpu-0| I120: Vix: [3038322 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:36.110Z| vcpu-1| I120: Vix: [3038323 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:37.108Z| vcpu-1| I120: Vix: [3038323 guestCommands.c:1985]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): A guest cmd cannot be sent because the connection to the guest was closed; giving up, op=185.
./VCH-3922-9624-container-logs/00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113/vmware.log:2016-08-12T15:52:37.332Z| vmx| I120: Vix: [3038315 mainDispatch.c:1188]: VMAutomationPowerOff: Powering off.
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:39.865Z| vmx| I120: Vix: [3038463 mainDispatch.c:1082]: VMAutomation_PowerOn. Powering on.
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:39.928Z| vcpu-0| I120: Vix: [3038470 mainDispatch.c:4292]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1872, success=1 additionalError=0
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:39.928Z| vcpu-0| I120: Vix: [3038470 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=0, err=0).
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:39.928Z| vcpu-0| I120: Vix: [3038470 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:47.908Z| vcpu-1| I120: Vix: [3038471 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
./VCH-3922-9624-container-logs/42bc9117fb161a6d85cd548d58ef216f40827891941cb54453b6b54f8ce19e6a/vmware.log:2016-08-12T15:52:50.990Z| vmx| I120: Vix: [3038463 mainDispatch.c:1188]: VMAutomationPowerOff: Powering off.
@dougm still reading the code but can't we use the ping trick that you used in tests to make sure that we at least receive one ping request before returning from start? I think I'll try that - unless pings are not issues by esxi
Tether waits past when the first ping happens, waits for the PowerOn callback. Not sure if we should have tether start wait even longer. You can see all rpc traffic with toolbox.Trace = true. Could also have PL kill wait for tools running state if needed. But probably simpler to just do so in the test for now, to see if that is indeed the issue.
Sent from my iPhone
On Aug 12, 2016, at 10:51 AM, S.?a?lar Onur notifications@github.com<mailto:notifications@github.com> wrote:
@dougmhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dougm&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=axkhUU9yFQuXUwAqfOwmsQkAETsIHVcN-LtcU-j4nt4&s=3cepcBPeLcdTjZClTKPQqizu2vOyPzeMKuVoZYFePd0&e= still reading the code but can't we use the ping trick that you used in tests to make sure that we at least receive one ping request before returning from start? I think I'll try that - unless pings are not issues by esxi
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vmware_vic_issues_1946-23issuecomment-2D239514745&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=axkhUU9yFQuXUwAqfOwmsQkAETsIHVcN-LtcU-j4nt4&s=QFwbNP40vSYACojtL4p2cG9Kd_Kmkxh7X_OOUAEbfJE&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAB1269yiI5vprqQ-2DNyatmCcUjWfclrQks5qfLKpgaJpZM4JjQVa&d=CwMCaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZCsmHgNnc_bAmUHuR5kzTw&m=axkhUU9yFQuXUwAqfOwmsQkAETsIHVcN-LtcU-j4nt4&s=WFhtOrGeQuSdjYUutOMMAXOWO8rsPvoIf4TxXHbtZEI&e=.
2255 2016-08-12T15:52:36.051Z| vcpu-1| W110: GuestRpc: application toolbox, changing channel 65535 -> 0
2256 2016-08-12T15:52:36.051Z| vcpu-1| I120: GuestRpc: Channel 0, guest application toolbox.
2257 2016-08-12T15:52:36.053Z| vcpu-1| I120: TOOLS Received tools.set.version rpc call, version = 2147483647.
2258 2016-08-12T15:52:36.053Z| vcpu-1| I120: TOOLS setting legacy tools version to '2147483647', manifest status is 8
2259 2016-08-12T15:52:36.053Z| vcpu-1| I120: ToolsSetVersionWork: Updating the Tools Version in the disk
2260 2016-08-12T15:52:36.096Z| vcpu-1| I120: ToolsSetVersionWork: Done updating the Tools Version in the disk
2261 2016-08-12T15:52:36.096Z| vcpu-1| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'unmanaged', 'unmanaged', install possible
2262 2016-08-12T15:52:36.110Z| vcpu-1| I120: TOOLS state change 3 returned status 1
2263 2016-08-12T15:52:36.110Z| vcpu-1| I120: Vix: [3038323 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
2264 2016-08-12T15:52:36.225Z| vmx| A115: ConfigDB: Setting extendedConfigFile = "00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113.vmxf"
2265 2016-08-12T15:52:36.291Z| vcpu-1| I120: VMXNET3 user: Ethernet0 Driver Info: version = 17040640 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2266 2016-08-12T15:52:36.648Z| vcpu-1| A115: ConfigDB: Setting guestinfo..sessions|00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113.started = "true"
2267 2016-08-12T15:52:36.863Z| vmx| I120: VigorTransportProcessClientPayload: opID=90618614 seq=634: Receiving GuestOps.StartProgram request.
2268 2016-08-12T15:52:36.998Z| vcpu-1| A115: ConfigDB: Setting guestinfo..sessions|00df61b827972916bda8adbc9eb698597289a16644bd1c515630994253214113.status = "-1"
2269 2016-08-12T15:52:37.108Z| vcpu-1| I120: GuestRpc: Reinitializing Channel 0(toolbox)
2270 2016-08-12T15:52:37.108Z| vcpu-1| I120: Vix: [3038323 guestCommands.c:1985]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): A guest cmd cannot be sent because the connection to the guest was closed; giving up, op=185.
2271 2016-08-12T15:52:37.108Z| vcpu-1| I120: VigorTransport_ServerSendResponse opID=90618614 seq=634: Completed GuestOps request.
2272 2016-08-12T15:52:37.224Z| vcpu-0| I120: PIIX4: PM Soft Off. Good-bye.
2273 2016-08-12T15:52:37.224Z| vcpu-0| I120: Chipset: Issuing power-off request...
another snippet. This shows that we register just fine. Following events are interesting though
2269 2016-08-12T15:52:37.108Z| vcpu-1| I120: GuestRpc: Reinitializing Channel 0(toolbox)
2270 2016-08-12T15:52:37.108Z| vcpu-1| I120: Vix: [3038323 guestCommands.c:1985]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): A guest cmd cannot be sent because the connection to the guest was closed; giving up, op=185.
cough internal code cough shows that this is happening because of a VMX reset, VMX is flushing all pending rpc requests and signals the clients to send them again - if needed.
I keep running the tests, this time it failed with
Installer completed successfully: VCH-3950-8217...
Signal a container with default kill signal | PASS |
------------------------------------------------------------------------------
Signal a container with SIGHUP | FAIL |
Keyword 'Assert Kill Signal' failed after retrying 5 times. The last error was: '' does not contain 'KillSignalHUP'
------------------------------------------------------------------------------
Signal a non-existent container | PASS |
------------------------------------------------------------------------------
Gathering logs from the test server...
Deleting the VCH appliance...
1-14-Docker-Kill :: Test 1-14 - Docker Kill | FAIL |
3 critical tests, 2 passed, 1 failed
3 tests total, 2 passed, 1 failed
==============================================================================
@caglar10ur did you take a look at the logs?
From the logs it looks like we gave up too soon. Maybe 5x is not enough?
Another failure, this time ESXi shows a protocol error caused by toolbox's reply
2016-08-15T17:48:25.526Z| vcpu-1| W110: GuestRpc: application toolbox, changing channel 65535 -> 0
2016-08-15T17:48:25.526Z| vcpu-1| I120: GuestRpc: Channel 0, guest application toolbox.
2016-08-15T17:48:25.529Z| vcpu-1| I120: TOOLS Received tools.set.version rpc call, version = 2147483647.
2016-08-15T17:48:25.529Z| vcpu-1| I120: TOOLS setting legacy tools version to '2147483647', manifest status is 8
2016-08-15T17:48:25.530Z| vcpu-1| I120: ToolsSetVersionWork: Updating the Tools Version in the disk
2016-08-15T17:48:25.651Z| vcpu-1| I120: ToolsSetVersionWork: Done updating the Tools Version in the disk
2016-08-15T17:48:25.651Z| vcpu-1| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'unmanaged', 'unmanaged', install possible
2016-08-15T17:48:25.660Z| vcpu-1| I120: TOOLS state change 3 returned status 1
2016-08-15T17:48:25.660Z| vcpu-1| I120: Vix: [3078056 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
2016-08-15T17:48:25.830Z| vmx| A115: ConfigDB: Setting extendedConfigFile = "3b89711432c0e312a4e6645390170e08c9d51b7e8b054e342c5e14913cc71c4c.vmxf"
2016-08-15T17:48:25.858Z| vcpu-1| I120: VMXNET3 user: Ethernet0 Driver Info: version = 17040640 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2016-08-15T17:48:55.591Z| vcpu-1| I120: GuestRpc: Channel 0, protocol error: expected an OK or ERROR reply message from the guest application toolbox
2016-08-15T17:48:55.591Z| vcpu-1| I120: GuestRpc: Reinitializing Channel 0(toolbox)
2016-08-15T17:48:55.591Z| vcpu-1| I120: GuestMsg: Channel 0, Cannot unpost because the previous post is already completed
2016-08-15T17:48:55.603Z| vcpu-1| W110: GuestRpc: application toolbox, changing channel 65535 -> 0
2016-08-15T17:48:55.603Z| vcpu-1| I120: GuestRpc: Channel 0, guest application toolbox.
2016-08-15T17:48:55.603Z| vcpu-1| I120: TOOLS Received tools.set.version rpc call, version = 2147483647.
2016-08-15T17:48:55.603Z| vcpu-1| I120: ToolsSetVersionWork did nothing; new tools version (2147483647) matches old Tools version
2016-08-15T17:48:55.603Z| vcpu-1| I120: Vix: [3078056 mainDispatch.c:4210]: VMAutomationReportPowerStateChange: Reporting power state change (opcode=2, err=0).
@caglar10ur is that just a timeout?
@hmahmood not sure, this time it looks like start failed. We don't have all the logs (like debug) and vmware.log shows toolbox timeouts but it manages to register itself so that means tether was also running.
Sorry I should have put this in my previous comment. I asked because the timestamps around the error differ by ~30s:
2016-08-15T17:48:25.858Z| vcpu-1| I120: VMXNET3 user: Ethernet0 Driver Info: version = 17040640 gosBits = 2 gosType = 1, gosVer = 0, gosMisc = 0
2016-08-15T17:48:55.591Z| vcpu-1| I120: GuestRpc: Channel 0, protocol error: expected an OK or ERROR reply message from the guest application toolbox
In trying to reproduce the kill test failures, I ran into #2327
With the kill tests enabled, I am able to sporadically reproduce the issue with #2327 where tether fails to initialize due to the eth0 having not yet presented its PCI path. I have a band-aid in place to retry the PCI patch and continuing to hammer away to see if these tests uncover any other issues.
Build 3922 failed with the following issue:
integration_logs_3922_57a4eb202703eeb7bb9e386dc590256b22fa1ee9_1471018933.zip