Closed kubrat closed 5 years ago
Which version are you using?
C:\Users\rpeter>rush -V rush v0.3.0
Checking new version... You are using the latest version of rush
Here is a standalone repro which can be copy pasted into a windows cmd prompt
(
echo 1
echo 10
) | rush -e -- "ping -n {} 8.8.8.8 && exit 1" || tasklist -fi "Imagename eq ping.exe"
It'll show a ping
still going after rush
exits. Remove the -e
and everything is clean as expected
With -e
C:\Users\rpeter>(
More? echo 1
More? echo 10
More? ) | rush -e -- "ping -n {} 8.8.8.8 && exit 1" || tasklist -fi "Imagename eq ping.exe"
8.8.8.8 && exit 1: exit status 1
[ERRO] stop on first error(s)
Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Ping statistics for 8.8.8.8:
Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 6ms, Maximum = 6ms, Average = 6ms
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
PING.EXE 16000 Console 1 3,276 K
Without -e
C:\Users\rpeter>(
More? echo 1
More? echo 10
More? ) | rush -- "ping -n {} 8.8.8.8 && exit 1" || tasklist -fi "Imagename eq ping.exe"
8.8.8.8 && exit 1: exit status 1
8.8.8.8 && exit 1: exit status 1
Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Ping statistics for 8.8.8.8:
Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 6ms, Maximum = 6ms, Average = 6ms
Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=8ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=7ms TTL=120
Reply from 8.8.8.8: bytes=32 time=15ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Ping statistics for 8.8.8.8:
Packets: Sent = 10, Received = 10, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 6ms, Maximum = 15ms, Average = 7ms
INFO: No tasks are running which match the specified criteria.
I think that rush
should be doing a similar reaping of the child processes on windows in the case of a -e
failure as it does like receiving a ctrl+c
. What do you think @shenwei356 @kubrat?
If you agree I can work on this and put a PR together?
Totally agreed, PR is welcome!
pwr22:
Thanks for your contribution. But can you put the new behavior behind a command line arg? I have some workflows where I want the old -e behavior. For cases where I want to fail on the first error but not kill processes, so those processes can do clean up.
@shenwei356 I've seen that -e
doesn't work on linux either - seems to leave the old processes hanging around
@bburgin @shenwei356 putting this behind a different flag is fine with me. The current behaviour of -e
is to exit as soon as a child process fails but it leaves the others untouched. I'll call this exit-early
for now
The behaviour I need is that as soon as any child process exits with failure that the entire batch is considered failed and terminated. I'll call this fail-fast
for now
@bburgin I'd be interested to hear what workflows you have where you want to run a batch of stuff, have part of them fail, know that part failed but then have the rest of the batch running unmonitored in the background?
I'm not imaginitive enough 😆 to think of a scenario where running without -e
wouldn't be better, since it leaves the other processes running but waits for them to complete and reports if any subsequent ones fail etc
For cases where I want to fail on the first error but not kill processes, so those processes can do clean up.
I'm not really familiar with Go's command execution stuff but doesn't it give processes a graceful signal and then some time to clean up before force killing? If it doesn't then that doesn't sound particularly well behaved...
pwr22:
The go exec library does not correctly kill processes on Windows. Thus, I added some code to rush to call taskkill on child processes, if you pass rush --kill-on-ctrl-c
if isWindows {
childProcess, err := psutil.NewProcess(int32(command.Process.Pid))
if err != nil {
Log.Error(err)
}
killWindowsProcessTreeRecursive(childProcess)
} else {
command.Process.Kill()
}
The recursive method will kill leaf child processes first and proceed up the child process tree from there.
I would still like to have a different arg for the kill case. How about a --kill-on-error arg ? This is so I can use the existing -e behavior (that doesn't kill), for cases like this:
Like you said, if any of these operations is short enough, I could just use the normal rush behavior, without the -e. But some of the above operations take awhile and I would like to exit early if I can. To help do this and still cleanup, you may want to look into a more graceful -e logic. You could make rush send sigint up the child process tree from leaf children, and then wait for the children to exit, without killing them. But I am unsure if the go signal library can send sigint to child processes on Windows.
If you want, you could also apply the sigint logic you make to --kill-on-ctrl-c and --kill-on-error. Rush could send sigint, wait for some timeout time, and then call the kill recursive logic I mentioned above. But I would like an additional arg for the timeout, so the user can specify the timeout time to wait before killing. Here is an example of a wait loop: https://stackoverflow.com/a/11886829
Thoughts?
I'll play with exec
and things a bit and see what I can come up with
At the moment my usage of rush
is through Jenkins which seems to clean up the process tree well enough at the end of a run but something graceful within rush
would still probably be better
As far as I can tell the Windows signalling story is not good, especially for console applications. Outside of the equivalent of SIGKILL
there doesn't really seem to be a standard signalling mechanism beyond CTRL+C
keyboard events (and a few others but that'll be the best one to aim for)
So that's what I'll try to send from Go
Sounds good. Maybe something like this code?: https://github.com/mackerelio/mackerel-agent/blob/master/wix/wrapper/wrapper_windows.go
The go exec library does not correctly kill processes on Windows.
@bburgin could you clarify a bit more on this?
My guess based on the implementation of killWindowsProcessTreeRecursive
is that killing something like cmd.exe
didn't result in killing its children? If so then this is not a Go problem per se but a peculiarity of Windows
Not sure why there's any code to do killing on ctrl+c
at all though - wouldn't it be better to just leave it to signal propagation to let children close up gracefully?
Edit: a guess on why this could be - does rush
need to do some housekeeping before stopping (flushing output etc) and does catching the signal stop it propagating or something?
Sounds good. Maybe something like this code?: https://github.com/mackerelio/mackerel-agent/blob/master/wix/wrapper/wrapper_windows.go
Looks quite similar to this I've found in the tests of Go itself 😆. I tested it and it seems to do the job 👍
I've spent the last hour or so trying to find the appropriate place in the codebase to insert these changes but I think its a bit beyond me at this point 😕. Any chance someone more familiar with the codebase would like to take this up now that we have the "how"?
I guess the semantics I'd ideally like for -e
would be for rush to detect the first failure, signal gracefully all the other processes (this ctrl+c
is the best we can do on Windows but it should probably be SIGTERM
on *NIX) and then exit itself. I don't really care too much about rush chasing up after misbehaving processes and I'd probably suggest that be behind some sort of --force
type flag anyway
pwr22:
I am guessing that, yes, handling ctrl+c in rush means that no child process get ctrl+c unless rush explicitly sends it to the child processes. I have not tested this.
I added killWindowsProcessTreeRecursive because some child processes do not respond to ctrl+c and must be killed to be aborted. With this, I can avoid having to launch process explorer and manually clean up children, each time I want to abort rush.
For who to implement the fix, I don't have the bandwidth to work on it. And it is more work than just adjusting the existing -e arg. We would need a new arg for the new behavior. Also, there would need to be a timeout arg, to allow the user to control the timeout time for the wait loop: the wait in between sending ctrl+c, waiting for the children to exit, and then killing them if they have not exited.
I am guessing that, yes, handling ctrl+c in rush means that no child process get ctrl+c unless rush explicitly sends it to the child processes. I have not tested this.
I tested exec
in isolation to confirm this and didn't see this behaviour 🤷♂️
For who to implement the fix, I don't have the bandwidth to work on it. And it is more work than just adjusting the existing -e arg
That's fair enough. I took at look at how easy it would be to bolt it into the existing -e
and given the flow of the program it'd be non-trivial for me to work out where to put it
Example:
seq 1 10 | rush -j 10 "sleep {} && exit 1" -e
With
-e
rush will exit immediately as expected as soon as the first error is encountered but the rest of the jobs will keep running until completion which is a problem in my use case.Note that this a Windows specific issue and the example uses the familiar
seq
andsleep
which I have obtained through MSYS2.