shenwei356 / rush

A cross-platform command-line tool for executing jobs in parallel
https://github.com/shenwei356/rush
MIT License
846 stars 63 forks source link

rush -e does not kill jobs #17

Closed kubrat closed 5 years ago

kubrat commented 5 years ago

Example: seq 1 10 | rush -j 10 "sleep {} && exit 1" -e

With -e rush will exit immediately as expected as soon as the first error is encountered but the rest of the jobs will keep running until completion which is a problem in my use case.

Note that this a Windows specific issue and the example uses the familiar seq and sleep which I have obtained through MSYS2.

shenwei356 commented 5 years ago

Which version are you using?

pwr22 commented 5 years ago

C:\Users\rpeter>rush -V rush v0.3.0

Checking new version... You are using the latest version of rush

Here is a standalone repro which can be copy pasted into a windows cmd prompt

(
  echo 1
  echo 10
) | rush -e -- "ping -n {} 8.8.8.8 && exit 1" || tasklist -fi "Imagename eq ping.exe"

It'll show a ping still going after rush exits. Remove the -e and everything is clean as expected

With -e

C:\Users\rpeter>(
More?   echo 1
More?   echo 10
More? ) | rush -e -- "ping -n {} 8.8.8.8 && exit 1" || tasklist -fi "Imagename eq ping.exe"
 8.8.8.8 && exit 1: exit status 1
[ERRO] stop on first error(s)

Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120

Ping statistics for 8.8.8.8:
    Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 6ms, Maximum = 6ms, Average = 6ms

Image Name                     PID Session Name        Session#    Mem Usage
========================= ======== ================ =========== ============
PING.EXE                     16000 Console                    1      3,276 K

Without -e

C:\Users\rpeter>(
More?   echo 1
More?   echo 10
More? ) | rush -- "ping -n {} 8.8.8.8 && exit 1" || tasklist -fi "Imagename eq ping.exe"
 8.8.8.8 && exit 1: exit status 1
 8.8.8.8 && exit 1: exit status 1

Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120

Ping statistics for 8.8.8.8:
    Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 6ms, Maximum = 6ms, Average = 6ms

Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=8ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=7ms TTL=120
Reply from 8.8.8.8: bytes=32 time=15ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120
Reply from 8.8.8.8: bytes=32 time=6ms TTL=120

Ping statistics for 8.8.8.8:
    Packets: Sent = 10, Received = 10, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 6ms, Maximum = 15ms, Average = 7ms
INFO: No tasks are running which match the specified criteria.
pwr22 commented 5 years ago

I think that rush should be doing a similar reaping of the child processes on windows in the case of a -e failure as it does like receiving a ctrl+c. What do you think @shenwei356 @kubrat?

If you agree I can work on this and put a PR together?

shenwei356 commented 5 years ago

Totally agreed, PR is welcome!

bburgin commented 5 years ago

pwr22:

Thanks for your contribution. But can you put the new behavior behind a command line arg? I have some workflows where I want the old -e behavior. For cases where I want to fail on the first error but not kill processes, so those processes can do clean up.

pwr22 commented 5 years ago

@shenwei356 I've seen that -e doesn't work on linux either - seems to leave the old processes hanging around

@bburgin @shenwei356 putting this behind a different flag is fine with me. The current behaviour of -e is to exit as soon as a child process fails but it leaves the others untouched. I'll call this exit-early for now

The behaviour I need is that as soon as any child process exits with failure that the entire batch is considered failed and terminated. I'll call this fail-fast for now

@bburgin I'd be interested to hear what workflows you have where you want to run a batch of stuff, have part of them fail, know that part failed but then have the rest of the batch running unmonitored in the background?

I'm not imaginitive enough 😆 to think of a scenario where running without -e wouldn't be better, since it leaves the other processes running but waits for them to complete and reports if any subsequent ones fail etc

pwr22 commented 5 years ago

For cases where I want to fail on the first error but not kill processes, so those processes can do clean up.

I'm not really familiar with Go's command execution stuff but doesn't it give processes a graceful signal and then some time to clean up before force killing? If it doesn't then that doesn't sound particularly well behaved...

bburgin commented 5 years ago

pwr22:

The go exec library does not correctly kill processes on Windows. Thus, I added some code to rush to call taskkill on child processes, if you pass rush --kill-on-ctrl-c

if isWindows {
    childProcess, err := psutil.NewProcess(int32(command.Process.Pid))
        if err != nil {
            Log.Error(err)
        }
        killWindowsProcessTreeRecursive(childProcess)
    } else {
        command.Process.Kill()
}

The recursive method will kill leaf child processes first and proceed up the child process tree from there.

I would still like to have a different arg for the kill case. How about a --kill-on-error arg ? This is so I can use the existing -e behavior (that doesn't kill), for cases like this:

  1. Build tools that modify the file system The tools know how to clean themselves up on exit. But if they get killed, I don't know what files the tools touched or left in a bad state. I could run a general cleanup script at the start of every run of the tools, but this adds time to the build and I have to maintain a list of files / folders to cleanup.
  2. Client apps that send operations to a server and then wait on them to be done If the clients get killed, the server will just continue executing the operations, using up server resources. If rush's -e could send a signit to child processes, that would help. The sigint could allow the client to signal the server to cancel its operation, freeing up server resources. But a kill would cause the server resources to be used until the operation completes.
  3. Test tools that report results to a network server, but only at end of the test run I still want the results posted to the server, even on error.
  4. A hardware test app that turns voltage output on and off, during a test sequence If the app gets killed, it could leave the voltage output on for an indeterminate amount of time. This would be a user hazard, if the voltage is high.

Like you said, if any of these operations is short enough, I could just use the normal rush behavior, without the -e. But some of the above operations take awhile and I would like to exit early if I can. To help do this and still cleanup, you may want to look into a more graceful -e logic. You could make rush send sigint up the child process tree from leaf children, and then wait for the children to exit, without killing them. But I am unsure if the go signal library can send sigint to child processes on Windows.

If you want, you could also apply the sigint logic you make to --kill-on-ctrl-c and --kill-on-error. Rush could send sigint, wait for some timeout time, and then call the kill recursive logic I mentioned above. But I would like an additional arg for the timeout, so the user can specify the timeout time to wait before killing. Here is an example of a wait loop: https://stackoverflow.com/a/11886829

Thoughts?

pwr22 commented 5 years ago

I'll play with exec and things a bit and see what I can come up with

At the moment my usage of rush is through Jenkins which seems to clean up the process tree well enough at the end of a run but something graceful within rush would still probably be better

pwr22 commented 5 years ago

As far as I can tell the Windows signalling story is not good, especially for console applications. Outside of the equivalent of SIGKILL there doesn't really seem to be a standard signalling mechanism beyond CTRL+C keyboard events (and a few others but that'll be the best one to aim for)

So that's what I'll try to send from Go

bburgin commented 5 years ago

Sounds good. Maybe something like this code?: https://github.com/mackerelio/mackerel-agent/blob/master/wix/wrapper/wrapper_windows.go

pwr22 commented 5 years ago

The go exec library does not correctly kill processes on Windows.

@bburgin could you clarify a bit more on this?

My guess based on the implementation of killWindowsProcessTreeRecursive is that killing something like cmd.exe didn't result in killing its children? If so then this is not a Go problem per se but a peculiarity of Windows

Not sure why there's any code to do killing on ctrl+c at all though - wouldn't it be better to just leave it to signal propagation to let children close up gracefully?

Edit: a guess on why this could be - does rush need to do some housekeeping before stopping (flushing output etc) and does catching the signal stop it propagating or something?

pwr22 commented 5 years ago

Sounds good. Maybe something like this code?: https://github.com/mackerelio/mackerel-agent/blob/master/wix/wrapper/wrapper_windows.go

Looks quite similar to this I've found in the tests of Go itself 😆. I tested it and it seems to do the job 👍

I've spent the last hour or so trying to find the appropriate place in the codebase to insert these changes but I think its a bit beyond me at this point 😕. Any chance someone more familiar with the codebase would like to take this up now that we have the "how"?

I guess the semantics I'd ideally like for -e would be for rush to detect the first failure, signal gracefully all the other processes (this ctrl+c is the best we can do on Windows but it should probably be SIGTERM on *NIX) and then exit itself. I don't really care too much about rush chasing up after misbehaving processes and I'd probably suggest that be behind some sort of --force type flag anyway

bburgin commented 5 years ago

pwr22:

I am guessing that, yes, handling ctrl+c in rush means that no child process get ctrl+c unless rush explicitly sends it to the child processes. I have not tested this.

I added killWindowsProcessTreeRecursive because some child processes do not respond to ctrl+c and must be killed to be aborted. With this, I can avoid having to launch process explorer and manually clean up children, each time I want to abort rush.

For who to implement the fix, I don't have the bandwidth to work on it. And it is more work than just adjusting the existing -e arg. We would need a new arg for the new behavior. Also, there would need to be a timeout arg, to allow the user to control the timeout time for the wait loop: the wait in between sending ctrl+c, waiting for the children to exit, and then killing them if they have not exited.

pwr22 commented 5 years ago

I am guessing that, yes, handling ctrl+c in rush means that no child process get ctrl+c unless rush explicitly sends it to the child processes. I have not tested this.

I tested exec in isolation to confirm this and didn't see this behaviour 🤷‍♂️

For who to implement the fix, I don't have the bandwidth to work on it. And it is more work than just adjusting the existing -e arg

That's fair enough. I took at look at how easy it would be to bolt it into the existing -e and given the flow of the program it'd be non-trivial for me to work out where to put it