Closed adiroiban closed 11 months ago
I think that the automatic process restart implemented in shawl is nice... but can lead to hiding some important bugs.
The process is restarted in the background ... with only a log in shawl log file.
But I guess most sysadmin are monitoring the Windows Service Event Log events. And if you check the Windows Event Log, there is nothing to hint about the restarts.
Hi! Strangely, I'm not seeing the same behavior that you have. Do you have "enable actions for stops with errors" checked? On my end, it won't run the recovery action unless I check that option. Either way, I do see an event log for the error, though.
When I end the TCP Ping process, I see these event logs:
This is my service Shawl invocation:
C:\git\shawl\target\debug\shawl.exe run --name shawl-test --no-restart --cwd \\?\C:\tmp --log-dir \\?\C:\tmp -- ping -n 1000 127.0.0.1
The recovery options:
Logger script:
import sys
with open("C:/tmp/service-logger.log", "a") as f:
f.write("Args: {}\n".format(sys.argv[1:]))
Many thanks for looking into this.
I have updated the title of this ticket.
If I have "Enable actions for stops with errors", I get the recovery action triggered.
I now understand what is going on here.
I was thinking that Enable actions for stops with errors
for designed to be used for the case when the process stop fails , for example if stops times out.
But it is to be used for any stop with failure.
The regular
"First failure" or "Second failure" are when the main service process is killed.
But shawl.exe is always up and running and it does a clean shutdown with an error code
This is why the regular
failure events can't be used for shawl.
For reference, in the case someone is looking into this in the future.
I am using Windows 2022 server , with shawl.exe 1.4.0 release
If I don't have "Enable actions for stops with errors" I get this.
I am trying to get this triggers
In the screeshot the last event is for ending the "shawl.exe" proces, the other 2 are for ending the ping process.
For the ping process I have event 7024, but it is not followed by 7031
If I have" Enable actions for stops with errors" the python script is called
If we want to use the regular
failues it would mean that shawl.exe should end with non zero exit code
I see that shawl is sending the Windows Service stop with error status, but then swawl.exe itself end with OK.
I think that here, it might help to have non zero exit code
But then you will have both stop with failure and service end with failure ... and if you have "Enable actions for stops with errors" both will be called.
One options is not to report the status via status_handle.set_service_status(ServiceStatus {
but just exit with non zero code.
But this will break backward compatibilty... so we might need (yet :( ) another flag.
So in Windows we have 3 service end conditions
Condition A will never trigger any recovery action.
Condition B will only rigger the recovery if Enable actions for stops with errors
is checked
Condition C is not currently supported in shawl ... it is only trigger if somehow shawl itself crashes.
With the current code if we send Win API stop servie with error event and also end the shawl.exe process with non zero we might end up with the recovery action called twice.
I am going to close this as I don't have a good enought solution for this.
The Enable actions for stops with errors
can be a workaround.
If others need this functionality, we can reopen this and see how it can be done.
Many thanks again for your excelent work.
Thanks for the extra detail :+1:
For now, I've added a note to the README about how this works: https://github.com/mtkennerly/shawl#recovery
Exit the service with a failure when the process exit code is not 0 and using --no-restart
It would help to somehow forward the wrapped process exit code when configured with
--no-restart
In this way you can use the Windows Serive "Recovery" helpers
To reproduce, I have this shawl run
Start the service. It will start ok and and in the logs you will see
Open Windows Task Manager and end the TCP Ping process.
In the logs, you will see
If you check Windows Event logs, in the "Windows Logs -> System" section , you will see info event ID 7036 "The shawl-test service entered the stopped state."
If instead of ending the TCP ping process, you end the shawl-exe process, you will see error event ID
7031 "The shawl-test service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service."
Expected result: If
--no-restart
is configured and wrapped process exit code is not 0, trigger a Windows Service failure.Actual result: Even if the wrapped process exitcode is 1, the shawl Windows server is stopped like a normal succesfull stop