mtkennerly / shawl

Windows service wrapper for arbitrary commands
MIT License
520 stars 15 forks source link

Integrate with Window Service Recovery functions for "failure" not only for "stop with error" #45

Closed adiroiban closed 11 months ago

adiroiban commented 11 months ago

Exit the service with a failure when the process exit code is not 0 and using --no-restart

It would help to somehow forward the wrapped process exit code when configured with --no-restart

In this way you can use the Windows Serive "Recovery" helpers

image


To reproduce, I have this shawl run

C:\test\shawl.exe run --name shawl-test --no-restart  --cwd "\\?\C:\test" --log-dir "\\?\C:\test" -- ping -n 1000 127.0.0.1

Start the service. It will start ok and and in the logs you will see

2023-12-07 18:48:01 [DEBUG] Entering main service loop
2023-12-07 18:48:01 [INFO] Launching command
2023-12-07 18:48:01 [DEBUG] stdout: "Pinging 127.0.0.1 with 32 bytes of data:"
2023-12-07 18:48:01 [DEBUG] stdout: "Reply from 127.0.0.1: bytes=32 time<1ms TTL=128"
2023-12-07 18:48:02 [DEBUG] stdout: "Reply from 127.0.0.1: bytes=32 time<1ms TTL=128"

Open Windows Task Manager and end the TCP Ping process.

In the logs, you will see

2023-12-07 18:49:06 [DEBUG] stdout: "Reply from 127.0.0.1: bytes=32 time<1ms TTL=128"
2023-12-07 18:49:07 [DEBUG] stdout: "Reply from 127.0.0.1: bytes=32 time<1ms TTL=128"
2023-12-07 18:49:08 [INFO] Command exited with code 1
2023-12-07 18:49:08 [DEBUG] Exited main service loop
2023-12-07 18:49:08 [DEBUG] Finished successfully

If you check Windows Event logs, in the "Windows Logs -> System" section , you will see info event ID 7036 "The shawl-test service entered the stopped state."


If instead of ending the TCP ping process, you end the shawl-exe process, you will see error event ID

7031 "The shawl-test service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service."


Expected result: If --no-restart is configured and wrapped process exit code is not 0, trigger a Windows Service failure.

Actual result: Even if the wrapped process exitcode is 1, the shawl Windows server is stopped like a normal succesfull stop

adiroiban commented 11 months ago

I think that the automatic process restart implemented in shawl is nice... but can lead to hiding some important bugs.

The process is restarted in the background ... with only a log in shawl log file.

But I guess most sysadmin are monitoring the Windows Service Event Log events. And if you check the Windows Event Log, there is nothing to hint about the restarts.

mtkennerly commented 11 months ago

Hi! Strangely, I'm not seeing the same behavior that you have. Do you have "enable actions for stops with errors" checked? On my end, it won't run the recovery action unless I check that option. Either way, I do see an event log for the error, though.

When I end the TCP Ping process, I see these event logs:

image image

This is my service Shawl invocation:

C:\git\shawl\target\debug\shawl.exe run --name shawl-test --no-restart --cwd \\?\C:\tmp --log-dir \\?\C:\tmp -- ping -n 1000 127.0.0.1

The recovery options:

image

Logger script:

import sys

with open("C:/tmp/service-logger.log", "a") as f:
    f.write("Args: {}\n".format(sys.argv[1:]))
adiroiban commented 11 months ago

Many thanks for looking into this.

I have updated the title of this ticket.

If I have "Enable actions for stops with errors", I get the recovery action triggered.

I now understand what is going on here.

I was thinking that Enable actions for stops with errors for designed to be used for the case when the process stop fails , for example if stops times out.

But it is to be used for any stop with failure.


The regular "First failure" or "Second failure" are when the main service process is killed.

But shawl.exe is always up and running and it does a clean shutdown with an error code

image

This is why the regular failure events can't be used for shawl.


For reference, in the case someone is looking into this in the future.

I am using Windows 2022 server , with shawl.exe 1.4.0 release

If I don't have "Enable actions for stops with errors" I get this.

I am trying to get this triggers

In the screeshot the last event is for ending the "shawl.exe" proces, the other 2 are for ending the ping process.

image

For the ping process I have event 7024, but it is not followed by 7031

image


If I have" Enable actions for stops with errors" the python script is called

image

adiroiban commented 11 months ago

If we want to use the regular failues it would mean that shawl.exe should end with non zero exit code

I see that shawl is sending the Windows Service stop with error status, but then swawl.exe itself end with OK.

https://github.com/mtkennerly/shawl/blob/88f3a5949640b352b80a14ceef81c5e4be438095/src/service.rs#L374-L384

I think that here, it might help to have non zero exit code

But then you will have both stop with failure and service end with failure ... and if you have "Enable actions for stops with errors" both will be called.


One options is not to report the status via status_handle.set_service_status(ServiceStatus { but just exit with non zero code.

But this will break backward compatibilty... so we might need (yet :( ) another flag.

adiroiban commented 11 months ago

So in Windows we have 3 service end conditions

Condition A will never trigger any recovery action.

Condition B will only rigger the recovery if Enable actions for stops with errors is checked

Condition C is not currently supported in shawl ... it is only trigger if somehow shawl itself crashes.


With the current code if we send Win API stop servie with error event and also end the shawl.exe process with non zero we might end up with the recovery action called twice.


I am going to close this as I don't have a good enought solution for this.

The Enable actions for stops with errors can be a workaround.

If others need this functionality, we can reopen this and see how it can be done.


Many thanks again for your excelent work.

mtkennerly commented 11 months ago

Thanks for the extra detail :+1:

For now, I've added a note to the README about how this works: https://github.com/mtkennerly/shawl#recovery