vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
639 stars 173 forks source link

Concurrent Exec tests intermittently fail with various errors #7410

Open matthewavery opened 6 years ago

matthewavery commented 6 years ago

After PR #6787 exec has become more stable in the concurrent and container shutdown paths. However, it still suffers from some race condition style failures. This has mainly been seen in the CI environment and has passed locally(consecutively) against a high resource complex VSAN deployment. below is a short catalog of the failures that have been seen that will cause intermittent failures:

Simple Concurrent Exec:

Exec During Poweroff Of A Container Performing A Long Running Task:

Exec During Poweroff Of A Container Performing A Short Running Task:

REFERENCE LOGS:

Exec-CI-Failure-1-16518.zip Exec-CI-Failure-2-16515.zip Exec-Failure-3-FROM-FULL-CI.html.zip Exec-Failure-3-FROM-FULL-CI-16510.zip

Currently using the following for basic concurrency testing:

c=1;date;id=`docker ps -q | awk '{print$1}'`; for i in `seq 1 $c`; do docker ^Cec $id /bin/echo /tmp/$i &done;for i in `seq 1 $c`;do wait %$i 2>/dev/null;done;date

TODO:

mdubya66 commented 6 years ago

Making p2 for now. If it appears more frequently we can raise it again

mhagen-vmware commented 6 years ago

We will not see this fail because the tests are currently disabled: https://github.com/vmware/vic/blob/master/tests/test-cases/Group1-Docker-Commands/1-38-Docker-Exec.robot

hickeng commented 6 years ago

Now blocking a product go-live.

My suggestion for addressing the collisions is to add operation batching. The first vsphere operation will be dispatched, and if slow will cause the others to batch up behind it. The next operation issued will be the composite of the blocked operations.

hickeng commented 6 years ago

Batching mechanism is written, and now needs extracting from in-place into a distinct package as this also applies to volume create, disk pull, and host affinity. The outstanding piece for functional demonstration is the merging of the specs. I tried a general approach to begin with but it got too large for this specific issue. Now need to change it to be exec specific.

hickeng commented 6 years ago

I have batching code that is somewhat effective. We still legitimately see concurrent modifications at the port layer level, but we should not be seeing them propagate all the way back to the user in comparatively light loads, yet we are.

The following is needed to determine what's occurring:

  1. add logging of the handle to the port layer rest handlers so the full operation can be tracked (or provide cross component opid which would be better)
  2. determine whether the operations are taking so long that this is a legitimate issue, or
  3. figure out why a random backoff with increasing window is resulting in such frequent collisions - possible error in window implementation?

Also noted from inspecting the logs - there's way too many task.Inspect operations happening. It may be that @matthewavery work on #7969 will address this, but if not then I should track it down.

Also need to add the group serialization logic, or we will rarely actually hit the batching path. Currently exercising it via a hardcoded delay but that's not viable for anything except testing.

hickeng commented 6 years ago

I've added #7370 as a dependency because there's something in that space needed to correctly render the opid parent/child relationship in addition to the parentID being passed via swagger.

These two issues are added as dependencies as untangling the cross component concurrent operations is too unwieldy for effective progress in it's current form.

I am going to:

hickeng commented 6 years ago

Having paused most of this work to complete #7969 I've just picked up some testing again and observed a failure to detect exec of an exec with the palliative locking in place.

It looks as though we're performing a reload at the same time as handleSessionExit. We have the following which I presume is a mix of two threads (initializing session blocking on the lock held by handleSessionExit):

May 29 2018 20:35:28.484Z DEBUG [ END ]  [vic/lib/tether.establishNonPty:282] [22.034718ms] initializing nonpty handling for session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.496Z DEBUG Launched command with pid 480
May 29 2018 20:35:28.497Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:28.501Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [4.728902ms]
May 29 2018 20:35:28.518Z DEBUG Encoding result of launch for session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d under key: guestinfo.vice./execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc5
May 29 2018 20:35:28.496Z DEBUG Writer goroutine for stderr returned: %!s(<nil>)
May 29 2018 20:35:28.542Z DEBUG Writer goroutine for stderr exiting
May 29 2018 20:35:28.496Z DEBUG Inspecting children with status change
May 29 2018 20:35:28.553Z DEBUG Reaped process 480, return code: 0
May 29 2018 20:35:28.558Z DEBUG Removed child pid: 480
May 29 2018 20:35:28.535Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.createtime@non-persistent, value: "0"
May 29 2018 20:35:28.502Z DEBUG [0xc420269d70] write "1\n" to 1 writers (err: <nil>)
May 29 2018 20:35:28.583Z DEBUG Writer goroutine for stdout returned: %!s(<nil>)
May 29 2018 20:35:28.589Z DEBUG Writer goroutine for stdout exiting
May 29 2018 20:35:28.578Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.starttime@non-persistent, value: "1527626128"
May 29 2018 20:35:28.611Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.stoptime@non-persistent, value: "0"
May 29 2018 20:35:28.627Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.diagnostics.resurrections@non-persistent, value: "0"
May 29 2018 20:35:28.679Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.status@non-persistent, value: "0"
May 29 2018 20:35:28.698Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.started@non-persistent, value: "true"
May 29 2018 20:35:28.675Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Read:100] [368.928273ms]
May 29 2018 20:35:28.762Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Read:100]
May 29 2018 20:35:28.862Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.runblock@non-persistent, value: "false"
May 29 2018 20:35:28.875Z DEBUG [ END ]  [vic/lib/tether.(*tether).launch:748] [1.670847819s] launching session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.913Z INFO  Loading main configuration
May 29 2018 20:35:28.913Z DEBUG [BEGIN]  [vic/lib/tether.(*tether).handleSessionExit:678] handling exit of session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:28.913Z DEBUG Waiting on session.wait
May 29 2018 20:35:28.914Z DEBUG Wait on session.wait completed
May 29 2018 20:35:28.915Z DEBUG Calling wait on cmd
May 29 2018 20:35:28.915Z DEBUG Wait returned waitid: no child processes
May 29 2018 20:35:28.916Z DEBUG Calling close on writers
May 29 2018 20:35:28.918Z DEBUG [0xc420269d70] Close on writers
May 29 2018 20:35:28.923Z DEBUG [0xc420269d70] Closing writer &{chanType:attach extraData:[101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 48 49
May 29 2018 20:35:28.949Z DEBUG [0xc420269d70] is a CloseWriter%!(EXTRA *ssh.channel=&{attach [101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 4
May 29 2018 20:35:28.977Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:28.941Z DEBUG GuestInfoSource: key: guestinfo.vice./common/name, value: "", error: key not found
May 29 2018 20:35:28.983Z DEBUG GuestInfoSource: key: guestinfo.vice./common/id, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2", error: %!s(<nil>)
May 29 2018 20:35:28.985Z DEBUG GuestInfoSource: key: guestinfo.vice./diagnostics/debug, value: "2", error: %!s(<nil>)
May 29 2018 20:35:28.987Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2", error: %!s(<nil>)
May 29 2018 20:35:28.988Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [10.961665ms]
May 29 2018 20:35:28.989Z DEBUG [0xc420269e00] Close on writers
May 29 2018 20:35:28.991Z DEBUG [0xc420269e00] Closing writer &{code:1 ch:0xc4200a5500}
May 29 2018 20:35:28.993Z DEBUG Calling close on reader
May 29 2018 20:35:28.994Z DEBUG [0xc4203d0880] Close on readers
May 29 2018 20:35:28.995Z DEBUG [0xc4203d0880] Closing reader &{chanType:attach extraData:[101 51 48 100 100 98 97 98 54 102 57 53 54 97 50 100 100 48 52 55 54 52 52 49 57 102 51 48 55 53 55 49 50 54 102 97 48 49
May 29 2018 20:35:29.000Z DEBUG [BEGIN]  [vic/pkg/serial.(*RawConn).Write:147]
May 29 2018 20:35:29.000Z DEBUG [ END ]  [vic/pkg/serial.(*RawConn).Write:147] [14.044µs]                                                                                                                         )
May 29 2018 20:35:29.000Z DEBUG Unblocking session: e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:29.002Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.createtime@non-persistent, value: "0"
May 29 2018 20:35:28.992Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/ExecutionEnvironment, value: "", error: %!s(<nil>)
May 29 2018 20:35:29.006Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/id, value: "70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e00
May 29 2018 20:35:29.009Z DEBUG GuestInfoSource: key: guestinfo.vice./sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/name, value: "", error: %!s(<nil>)
May 29 2018 20:35:29.010Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.starttime@non-persistent, value: "1527626128"
May 29 2018 20:35:29.011Z DEBUG GuestInfoSource: key: sessions|70a90488d7e0e4413fc56dbb769e978cc47ac7132fdf50e009d4abeb4ba52fa2/common/notes, value: "", error: key not found
<snip>
May 29 2018 20:35:34.592Z DEBUG Initializing session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d
May 29 2018 20:35:34.624Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.detail.stoptime@non-persistent, value: "0"
May 29 2018 20:35:34.626Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.diagnostics.resurrections@non-persistent, value: "0"
May 29 2018 20:35:34.628Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.status@non-persistent, value: "0"
May 29 2018 20:35:34.630Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.started@non-persistent, value: "true"
May 29 2018 20:35:34.632Z DEBUG GuestInfoSink: setting key: guestinfo.vice..execs|e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d.runblock@non-persistent, value: "false"
May 29 2018 20:35:34.634Z DEBUG Calling t.ops.HandleSessionExit
May 29 2018 20:35:34.635Z DEBUG [ END ]  [vic/lib/tether.(*tether).handleSessionExit:678] [5.722602311s] handling exit of session e30ddbab6f956a2dd04764419f30757126fa018ccfc58c76cece877e5b43024d

Updating the child reaper to take t.config.Lock() while writing the completion state avoids the StopTime being overwritten, but we still have the problem of the Reload overwriting live session state - the most problematic being the started field.

hickeng commented 6 years ago
hickeng commented 6 years ago

Moving this out of scope for 1.4.1 now as the mitigation work has merged in the branches noted above (#8101). I've put it into 1.4.2 simply to force discussion about whether there is additional work that can/should be done in a patch release.

zjs commented 6 years ago

@lcastellano: Is this completely fixed, or just better mitigated? Is there additional work planned for this? (If not, perhaps we should lower the priority and move back to the backlog.)

renmaosheng commented 6 years ago

https://ci-vic.vmware.com/vmware/vic/19877/7 failed in 1-38-Docker-Exec -> Concurrent Simple Exec it seems the exec succeeds, but the output is not copied to stout.

renmaosheng commented 6 years ago

https://ci-vic.vmware.com/vmware/vic/19947/7 re-occurred in 19947.

hickeng commented 6 years ago

@zjs it's just better mitigated. No additional work planned at this time for 1.4.3. Moving to backlog.

@renmaosheng I'm going to defer any investigation for the prior two comments as we were seeing odd network level issues over this timeframe. If it occurs again we will open a dedicated issue for the output as it's not part of the same symptom set covered by this issue.

renmaosheng commented 6 years ago

@hickeng thanks, will file a new issue once we see it happens in future.

renmaosheng commented 6 years ago

Hi George, we observed several times failure of the 1-38-Docker-Exec -> Concurrent Simple Exec test(https://ci-vic.vmware.com/vmware/vic/20001/7 ), some of the 'exec ls' not returning the content, but return value is Zero. Is this the same issue you are investigating in the PR? current test starting 50 processes, it will happen intermittently, do you think it is OK to move the test to scenario? but not in the ci-integration to prevent from build generation?

Concurrent Simple Exec                                                | FAIL |
'' does not contain 'bin'
luwang-vmware commented 5 years ago

Intermittently hit the issue of ''' does not contain 'bin'' in 1-38-Docker-Exec -> Concurrent Simple Exec test. Latest result is in https://ci-vic.vmware.com/vmware/vic/20267/.

hickeng commented 5 years ago

Triage of the prior comment, https://github.com/vmware/vic/issues/7410#issuecomment-435311219

Starting point: log.html shows that the target container is Exec-simple-20267-2329 and it's the 37th iteration of the exec that failed - we do not know if subsequent iterations failed from this log output.

Locating the opid: The test does not do anything useful for differentiating between iterations such as packing the index into an environment variable or similar. We have the Start/End/Elapsed time: 20181101 22:03:25.277 / 20181101 22:03:25.323 / 00:00:00.046 - a quick check of the VCH logs shows this time is not useful to us (the minutes to no align with the operation in question - the hours are UTC-7 at a glance).

As such we can do the following:

  1. locate the 37th ExecCreate to be dispatched to the portlayer from personality.log
  2. locate either:
    1. the 37th exec to be created in the portlayer log
    2. the first exec that returned 0 bytes of output from the portlayer log
  3. try to locate the 37th iteration in tether to get the exec ID from tether.debug
  4. look for an exec that did not return data from the portlayer (as this has effective logging for the purpose)

From tether.debug - found exec ID a4b931bcace3bd6e9e463fd42727985d0ad4f65aef1e331e12fa521fb2824532 as the 37th via regexp Encoding result of launch for session .* under key: guestinfo.vice./execs (expression identified simply by looking at tether.debug for a string unique to launch of exec processes.

From portlayer looking for 0 byte return:

  1. find the time range we're interested in - search for cID 84ce478f049f and extract all operations for that container into a different file.
  2. we expected to see the following pairs of lines for stdout/err when a session is closed (as an example) - the ID here is the exec ID, not the container ID:
    DEBUG Finished streaming stdout for f6f2ab82587f230f35deb1c2464c3818b249c175e982c74484930fec39dec143 (unwrapped bytes: 71)
    DEBUG Finished streaming stderr for f6f2ab82587f230f35deb1c2464c3818b249c175e982c74484930fec39dec143 (unwrapped bytes: 0)
    • I only find 96 matches, but given we launched 49 sessions, each with a stdout/err pair I was expected to get 98. We need to confirm how many execs were run - using tether.debug as a sanity check for how many processes were launched I search for Launched command with pid and find 50 matches (one container process, 49 exec processes) as expected. Figuring out which exec session is missing - we run a comparison of the exec sessions in tether vs the portlayer log. We take a fragment of the portlayer log from the container creation, through to the last reported update for the container (extraconfig id: 84ce478f0) and the final inspect after that:
      $ cat tether.debug | awk -F '|' '/Encoding result of launch for session .* under key: guestinfo.vice.\/execs/ {print$2}' | sed 's/@.*//' | sort -u > execs.list
      $ cat pl-fragment.log | awk '/DEBUG Finished streaming stdout for/ {print$10}' | sort -u > finished.list
      $ diff execs.list finished.list
      27d26
      < 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79

      Checking tether.debug (which is awkward because it doesn't report opIDs for differentiating between threads) we see the following ids for the stdout/err/in MultiReader/MultiWriter - we expect these to have streams from an ssh session added to the when the attach succeeds:

      Nov  2 2018 05:03:43.156Z DEBUG Initializing session 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79
      Nov  2 2018 05:03:43.159Z DEBUG [BEGIN]  [main.(*operations).SessionLog:85] configure session log writer
      Nov  2 2018 05:03:43.160Z WARN  unable to log more than one session concurrently to persistent logging
      Nov  2 2018 05:03:43.162Z DEBUG [0xc4202b3260] created multiwriter
      Nov  2 2018 05:03:43.164Z DEBUG [0xc4202b3320] created multiwriter
      Nov  2 2018 05:03:43.165Z DEBUG [ END ]  [main.(*operations).SessionLog:85] [6.139973ms] configure session log writer
      Nov  2 2018 05:03:43.166Z DEBUG [0xc420481740] created multireader

      Searching for the MultiWriters we get the following: stdout - if there are no writers present then pkg/dio/writer will skip writing entirely, leading to the log message of write "" to 0 writers - this is logging what was relayed to the added writers, not what was passed into the Write call:

      Line 5135: Nov  2 2018 05:03:43.162Z DEBUG [0xc4202b3260] created multiwriter
      Line 10755: Nov  2 2018 05:04:07.367Z DEBUG [0xc4202b3260] write "" to 0 writers (err: <nil>)
      Line 10779: Nov  2 2018 05:04:07.443Z DEBUG [0xc4202b3260] added writer - now 1 writers
      Line 10780: Nov  2 2018 05:04:07.444Z DEBUG [0xc4202b3260] Writer 0 [0xc42038a9c0]
      Line 10808: Nov  2 2018 05:04:07.500Z DEBUG [0xc4202b3260] Close on writers

      where as, in contrast, we have the following for a successful exec (stdout):

      Line 834: Nov  2 2018 05:03:26.809Z DEBUG [0xc4202b3170] created multiwriter
      Line 891: Nov  2 2018 05:03:28.130Z DEBUG [0xc4202b3170] added writer - now 1 writers
      Line 892: Nov  2 2018 05:03:28.131Z DEBUG [0xc4202b3170] Writer 0 [0xc42038b980]
      Line 927: Nov  2 2018 05:03:28.195Z DEBUG [0xc4202b3170] write "bin\ndev\netc\nhome\nlib\nlost+found\nmnt\nproc\nroot\nrun\nsbin\nsys\ntmp\nusr\nvar\n" to 1 writers (err: <nil>)
      Line 951: Nov  2 2018 05:03:28.246Z DEBUG [0xc4202b3170] Close on writers

Checking the portlayer we see that we timed out waiting for the interactive connection to complete:

Nov  2 2018 05:04:49.905Z DEBUG [BEGIN]  [vic/lib/portlayer/attach/communication.(*interaction).Unblock:227]
Nov  2 2018 05:04:49.905Z ERROR failed to unblock the other side: EOF
...
Nov  2 2018 05:04:53.522Z ERROR attach connector: Connection not found error for id:8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79: context deadline exceeded

The EOF error when trying to unblock the connection is unusual - normally this would be due to the SSH connection having dropped, but there's no evidence of that having happened, no renegotiation of the connection.

Checking in the tether to see whether we received the unblock request we did, but again there's the unexpected EOF error:

Nov  2 2018 05:04:07.610Z INFO  Received UnblockReq for 8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79
Nov  2 2018 05:04:07.612Z DEBUG Sending channel request reply true back
Nov  2 2018 05:04:07.613Z WARN  Failed replying to a channel request: EOF

Following this logic - we look for where we block waiting for the unblock request, and we find that we do not! There is no message about it for the 8adc6 session. We check whether the session was configured to block, and it's not - runblock is set to false

    Line 9403: Nov  2 2018 05:04:02.384Z DEBUG GuestInfoSource: key: guestinfo.vice..execs|8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79.runblock@non-persistent, value: "false", error: %!s(<nil>)
    Line 9404: Nov  2 2018 05:04:02.387Z DEBUG GuestInfoSource: key: guestinfo.vice./execs|8adc6e0173de182343cd6d991187173f04e30f67e82a7734f62dd863765d2b79/active@non-persistent, value: "true", error: %!s(<nil>)

The runblocking is set by a call to portalyer:attach.Bind() during the same set of calls which activates the session (and we can see the session is set to active in the keys just above):

Nov  2 2018 05:04:02.435Z DEBUG [BEGIN]  [vic/lib/portlayer/attach.Bind:26]
Nov  2 2018 05:04:02.435Z DEBUG Found a device with desired backing: &types.VirtualSerialPort{VirtualDevice:types.VirtualDevice{DynamicData:types.DynamicData{}, Key:9000, DeviceInfo:(*types.Description)(0xc4231e3560), Backing:(*types.VirtualSerialPortURIBackingInfo)(0xc422731d70), Connectable:(*types.VirtualDeviceConnectInfo)(0xc4231e3a00), SlotInfo:types.BaseVirtualDeviceBusSlotInfo(nil), ControllerKey:400, UnitNumber:(*int32)(0xc4227b5cdc)}, YieldOnPoll:true}
Nov  2 2018 05:04:02.435Z DEBUG Already in the desired state, (connected: true, proxyURI: telnet://10.158.214.94:2377)
Nov  2 2018 05:04:02.435Z DEBUG [ END ]  [vic/lib/portlayer/attach.Bind:26] [113.678µs] 

I attempted to confirm the value was set in vmware.log, but that file has been throttled.

I suspect that this is the only exec that's failed to return output - that's the behaviour expected based on what's observed in the logs, but the test needs updating to all explicit confirmation.

Tasks:

luwang-vmware commented 5 years ago

@hickeng for the following request, I would like to double confirm with you that where do you mean to inject index? Do you mean to inject index in the docker exec process? such as using "docker exec -v index=** sh -c 'echo $index; time /bin/ls'" to instead the existing one ?

hickeng commented 5 years ago

@luwang-vmware I think you can use docker exec -e index=${idx} <container_id> /bin/ls - all we need is a way of correlating the ContainerCreate call with the failing attempt in the docker personality log. An environment variable should be effective without actually changing the behaviour of the test.

luwang-vmware commented 5 years ago

@hickeng Updated the test scripts with the following requests. And the issue hit again in CI testing. Detail log is in https://ci-vic.vmware.com/vmware/vic/20326/7. We can see that only the 29th interation of exec hit issue.

luwang-vmware commented 5 years ago

The issue is hit again in recent CI testing, detail logs is in https://ci-vic.vmware.com/vmware/vic/20509/9. From the logs, it shows only the 36th iteration of exec has the issue.

renmaosheng commented 5 years ago

moving to 1.5.1 for continuous investigation.