Open elondaits opened 6 years ago
I imagine it would be possible for OMD to check system logs for messages from hilbert-station
. For that OMD would require a custom server-side check, which would SSH to specified station (will need access to ~/.ssh/
like Dashboard) and inspect the system logs... i will try to enable log-monitoring via the current omd_agent
and see how that may work. Note that Dashboard will still need to force OMD to perform status refresh on specific station!
Note that I list two different "possible improvements". One is related to the information that OMD gives me and the other does not involve OMD but instead the output from the hilbert cli process, which I already receive and could parse to show more info to the user.
yes, it is already possible to monitor the progress of app_change
and VERY briefly of poweroff
calls.
the logging output from hilbert-station
to console looks as follows (but colored according to message type):
DEBUG [hilbert-station:???] DEBUG_MESSAGE
INFO [hilbert-station:cmd_???] INFO_MESSAGE
WARNING [hilbert-station:cmd_???] WARNING_MESSAGE
ERROR [hilbert-station:cmd_???] ERROR_MESSAGE
The log-level is increased by adding-v
switches to hilbert-station
. Current default: only ERROR and WARNING are visible, -v
adds INFO and -vv
outputs all log messages.
There is a way to control the log-level of remotely called hilbert-station
from server-side hilbert
tool: same verbosity switches will be forwarded to hilbert-station
.
Note that logging output over SSH has no colors but special prefixes instead:
=== DEBUG [hilbert-station:???] DEBUG_MESSAGE
INFO [hilbert-station:cmd_???] INFO_MESSAGE
!!! WARNING [hilbert-station:cmd_???] WARNING_MESSAGE
??? ERROR [hilbert-station:cmd_???] ERROR_MESSAGE
@elondaits If existing logging is not exactly what you wanted - it is possible either to adjust the current logging or to add analogous logging specifically for auto-tracking the execution progress by the Dashboard.
@porst17 BTW what about configuring all stations to immediately auto-send the logged messages to the server (if network if present)? That way we will easily be able to track the progress of startup / shutdown procedures...
maybe it is something to consider for the future.
@porst17 @elondaits please note that i changed/extended my comment above (https://github.com/hilbert/hilbert-cli/issues/83#issuecomment-373185546)
Regular logging messages are for system admins, who still have the log if they need it. What we need is markers at specific points (any points before a significant delay or after a big error) so we can give feedback to the user. Just like the status bar of a browser that tells you "Resolving host X..." and then "Connecting to X...", "Waiting for X...", etc. so you have an approximate idea of what is taking time and whether the operation is progressing or got stuck at some point.
hilbert-station
already sends custom messages to system logs. Thus it make any observer (e.g. OMD) able to track what exactly is it doing almost instantly.
Looks like you also want some output from hilbert
server-side tool regarding its remote execution actions. That should also be visible if log-level is verbose enough.
No.
I already receive output from the hilbert cli. It's the one I show when you click on the "monitor" icon in the dashboard.
I can't parse that, because it's output from the scripts which might change with updates and it's a lot.
What I want is specific outputs every time the script will perform some blocking / long action that might take a significant amount of time. A very short text that indicates which step is about to be performed so I can show that in the main screen of the dashboard. E.g.
||info||Opening ssh connection
||info||Waking up station
||info||Launching app
||info||Downloading new image
etc.
Can someone please post the current output of one of the commands, e.g. for changing an app? This can serve as a base for discussion. Because right now, @malex984 says proper progress reporting is already in place and @elondaits says it is not sufficient/inappropriate.
@porst17 actually i have suggested to extend the existing progress reporting with more data for Dashboard to show. See the gist for console outputs.
i also suppose that Dashboard should be able to filter out the specially formatted INFO/DEBUG/WARNING/ERROR messages and show them according to the level of verbosity required by user
I suppose, @malex984 and @elondaits should collect a list of useful progress messages for each hilbert
command here. @malex984 knows best what happens in the background and @elondaits has an idea of what he expects.
I don't see a problem if the progress messages are mixed with other INFO
messages as long as there is only one INFO
per long running subtask. So for example,
INFO long running subtask started
INFO Configuration file OK
then 30s nothing and then
INFO long running subtask completed
is problematic because the dashboard would show Configuration file OK
most of the time.
Additional progress messages for subtasks could also be prefixed with the task name:
INFO SUBTASK started
INFO SUBTASK 10%
INFO SUBTASK 20%
INFO SUBTASK 30%
...
INFO SUBTASK doing this
INFO SUBTASK doing that
INFO SUBTASK 100%
INFO SUBTASK completed
What I want is not the log / hilbert cli output. The dashboard user's concerns are different... I can't tell them Input file is a correct YAML!
because the dashboard user doesn't know what input file that is, or what YAML means, or why this is relevant.
Also, I don't want 100 messages cycling at breakneck speed. Just one message before any "significant" operation that might block. The browser's status bar is a good example because it's clearly not the browser's debugging output.
The hilbert cli output would be the mechanism to send these milestones to the dashboard, but the information and needs are different.
Now, this issue is nice but there are more pressing matters... if we have to discuss it for 3-5 days then I'd rather just close it. We can't take this long for every single thing and I still have bug / problem reports that haven't been answered.
I just do not want to introduce another set of analogous messages just for Dashboard users. It is clear that the content of messages can be easily adjusted to be also suitable for the Dashboard.
I am going to reserve INFO output level essentially for rough progress indication and probably move down to DEBUG-level those of existing INFO-messages, which would make no sense for Dashboard users...
How long does the execution of the cli commands take? @malex984 told me that long blocking operations should not happen during normal operation. Only after a config change happened and deployment is performed again on the stations. So my question is: Is this really an issue of the cli or of the monitoring system not providing the infos @elondaits needs?
I also figured out that it is possible to push check results to OMD (via livestatus from the stations to the server). I suppose this can be very useful to set certain "flags" in OMD ("currently switching app" etc). This can be discussed separately.
I don't know because I don't know how representative the "test environment" is of the actual production environment with the actual servers and actual apps over an actual network. Changing an app requires first killing the one running and starting another container, so it probably is far from immediate.
Killing and starting docker containers should be quite fast (milliseconds up to very few seconds). We would need @malex984 to confirm that, but I think the station script just starts the new container in the background and then exits. So this should be very fast after all.
I checked the code. Looks like docker-compose
starts everything in the background via -d
(detached mode). It is not the same as starting the whole docker-compose
process in background (with &
). I suppose it is waiting until all containers are spun up. But it is probably still quite fast, because it does not wait until the actual application is ready ... something docker-compose
can't know.
@porst17 @elondaits Starting a new container is quite fast (with previously pulled image and prepared cached plain docker-compose file, which all happens only during Hilbert stack initialization). Unfortunately the stopping of container is not immediate and involve short delays... there is a system variable hibert_station_process_kill_timeout
(default: 8s), which controls those delays. Please check https://github.com/hilbert/hilbert-cli/blob/devel/tools/hilbert-station#L1581 for details.
My feeling that a single container should be stopped/killed/removed in approx. 2x(hibert_station_process_kill_timeout+2) seconds.
@porst17 Should i add a timer to measure low-level operations precisely?
Regarding the slow shutdown of containers see hilbert/hilbert-docker-images#61
Goals:
Possible improvements:
Hilbert could indicate via CheckMK if any script is still running in the station (e.g. docker is still pulling an image, the stop script is still running, etc.). Let's have a documented format for this output which indicates:
Hilbert CLI could output "markers" with a predefined format that the UI can use to give feedback to the user in long running operations. E.g.
||info||Opening ssh connection
The
||
would be a control sequence that at the start of an output line indicates to the UI that it's a marker, then a notification type (info, warning, error), and finally a message that can be shown to the user. That way if the UI is stuck in the "Opening ssh connection" the user could figure out there's a network issue, a DNS issue, etc. even approximately.