Provide more state information to the dashboard

elondaits commented 6 years ago

Goals:

The dashboard might start an operation (e.g. start a station, change the top app) which can fail (the station does not have power, docker crashes halfway through, etc.). The dashboard will lock the station while the operation is running but at some point it should timeout the operation to give the chance to retry. It'd be unfortunate if the dashboard times out but the operation is still being performed correctly in the station.
Some operations take a long time and the user might wander if they're running correctly. The system should be more responsive by giving more feedback about long processes.

Possible improvements:

Hilbert could indicate via CheckMK if any script is still running in the station (e.g. docker is still pulling an image, the stop script is still running, etc.). Let's have a documented format for this output which indicates:
- Which app is running: Can be none (see issue https://github.com/hilbert/hilbert-docker-images/issues/22)
- If any hilbert script is running (e.g. if the station is "locked"). Ideally if it could indicate if the station is starting, stopping or changing apps... although I know the first two would be only possible in certain moments of that process.
- Or an error, if the information above could not be retrieved (also if a script failed?)
Hilbert CLI could output "markers" with a predefined format that the UI can use to give feedback to the user in long running operations. E.g.

||info||Opening ssh connection

The || would be a control sequence that at the start of an output line indicates to the UI that it's a marker, then a notification type (info, warning, error), and finally a message that can be shown to the user. That way if the UI is stuck in the "Opening ssh connection" the user could figure out there's a network issue, a DNS issue, etc. even approximately.

malex984 commented 6 years ago

I imagine it would be possible for OMD to check system logs for messages from hilbert-station. For that OMD would require a custom server-side check, which would SSH to specified station (will need access to ~/.ssh/ like Dashboard) and inspect the system logs... i will try to enable log-monitoring via the current omd_agent and see how that may work. Note that Dashboard will still need to force OMD to perform status refresh on specific station!

elondaits commented 6 years ago

Note that I list two different "possible improvements". One is related to the information that OMD gives me and the other does not involve OMD but instead the output from the hilbert cli process, which I already receive and could parse to show more info to the user.

malex984 commented 6 years ago

yes, it is already possible to monitor the progress of app_change and VERY briefly of poweroff calls. the logging output from hilbert-station to console looks as follows (but colored according to message type):

DEBUG   [hilbert-station:???] DEBUG_MESSAGE
INFO    [hilbert-station:cmd_???] INFO_MESSAGE
WARNING [hilbert-station:cmd_???] WARNING_MESSAGE
ERROR   [hilbert-station:cmd_???] ERROR_MESSAGE

The log-level is increased by adding-v switches to hilbert-station. Current default: only ERROR and WARNING are visible, -v adds INFO and -vv outputs all log messages.

There is a way to control the log-level of remotely called hilbert-station from server-side hilbert tool: same verbosity switches will be forwarded to hilbert-station. Note that logging output over SSH has no colors but special prefixes instead:

=== DEBUG   [hilbert-station:???] DEBUG_MESSAGE
    INFO    [hilbert-station:cmd_???] INFO_MESSAGE
!!! WARNING [hilbert-station:cmd_???] WARNING_MESSAGE
??? ERROR   [hilbert-station:cmd_???] ERROR_MESSAGE

@elondaits If existing logging is not exactly what you wanted - it is possible either to adjust the current logging or to add analogous logging specifically for auto-tracking the execution progress by the Dashboard.

malex984 commented 6 years ago

@porst17 BTW what about configuring all stations to immediately auto-send the logged messages to the server (if network if present)? That way we will easily be able to track the progress of startup / shutdown procedures...

maybe it is something to consider for the future.

malex984 commented 6 years ago

@porst17 @elondaits please note that i changed/extended my comment above (https://github.com/hilbert/hilbert-cli/issues/83#issuecomment-373185546)

elondaits commented 6 years ago

Regular logging messages are for system admins, who still have the log if they need it. What we need is markers at specific points (any points before a significant delay or after a big error) so we can give feedback to the user. Just like the status bar of a browser that tells you "Resolving host X..." and then "Connecting to X...", "Waiting for X...", etc. so you have an approximate idea of what is taking time and whether the operation is progressing or got stuck at some point.

malex984 commented 6 years ago

hilbert-station already sends custom messages to system logs. Thus it make any observer (e.g. OMD) able to track what exactly is it doing almost instantly.

Looks like you also want some output from hilbert server-side tool regarding its remote execution actions. That should also be visible if log-level is verbose enough.

elondaits commented 6 years ago

No.

I already receive output from the hilbert cli. It's the one I show when you click on the "monitor" icon in the dashboard.

I can't parse that, because it's output from the scripts which might change with updates and it's a lot.

What I want is specific outputs every time the script will perform some blocking / long action that might take a significant amount of time. A very short text that indicates which step is about to be performed so I can show that in the main screen of the dashboard. E.g.

||info||Opening ssh connection

||info||Waking up station

||info||Launching app

||info||Downloading new image

etc.

porst17 commented 6 years ago

Can someone please post the current output of one of the commands, e.g. for changing an app? This can serve as a base for discussion. Because right now, @malex984 says proper progress reporting is already in place and @elondaits says it is not sufficient/inappropriate.

malex984 commented 6 years ago

@porst17 actually i have suggested to extend the existing progress reporting with more data for Dashboard to show. See the gist for console outputs.

malex984 commented 6 years ago

i also suppose that Dashboard should be able to filter out the specially formatted INFO/DEBUG/WARNING/ERROR messages and show them according to the level of verbosity required by user

porst17 commented 6 years ago

I suppose, @malex984 and @elondaits should collect a list of useful progress messages for each hilbert command here. @malex984 knows best what happens in the background and @elondaits has an idea of what he expects.

I don't see a problem if the progress messages are mixed with other INFO messages as long as there is only one INFO per long running subtask. So for example,

INFO  long running subtask started
INFO  Configuration file OK

then 30s nothing and then

INFO  long running subtask completed

is problematic because the dashboard would show Configuration file OK most of the time.

Additional progress messages for subtasks could also be prefixed with the task name:

INFO  SUBTASK started
INFO  SUBTASK 10%
INFO  SUBTASK 20%
INFO  SUBTASK 30%
...
INFO  SUBTASK doing this
INFO  SUBTASK doing that
INFO SUBTASK 100%
INFO  SUBTASK completed

elondaits commented 6 years ago

What I want is not the log / hilbert cli output. The dashboard user's concerns are different... I can't tell them Input file is a correct YAML! because the dashboard user doesn't know what input file that is, or what YAML means, or why this is relevant.

Also, I don't want 100 messages cycling at breakneck speed. Just one message before any "significant" operation that might block. The browser's status bar is a good example because it's clearly not the browser's debugging output.

The hilbert cli output would be the mechanism to send these milestones to the dashboard, but the information and needs are different.

Now, this issue is nice but there are more pressing matters... if we have to discuss it for 3-5 days then I'd rather just close it. We can't take this long for every single thing and I still have bug / problem reports that haven't been answered.

malex984 commented 6 years ago

I just do not want to introduce another set of analogous messages just for Dashboard users. It is clear that the content of messages can be easily adjusted to be also suitable for the Dashboard.

I am going to reserve INFO output level essentially for rough progress indication and probably move down to DEBUG-level those of existing INFO-messages, which would make no sense for Dashboard users...

porst17 commented 6 years ago

How long does the execution of the cli commands take? @malex984 told me that long blocking operations should not happen during normal operation. Only after a config change happened and deployment is performed again on the stations. So my question is: Is this really an issue of the cli or of the monitoring system not providing the infos @elondaits needs?

I also figured out that it is possible to push check results to OMD (via livestatus from the stations to the server). I suppose this can be very useful to set certain "flags" in OMD ("currently switching app" etc). This can be discussed separately.

elondaits commented 6 years ago

I don't know because I don't know how representative the "test environment" is of the actual production environment with the actual servers and actual apps over an actual network. Changing an app requires first killing the one running and starting another container, so it probably is far from immediate.

porst17 commented 6 years ago

Killing and starting docker containers should be quite fast (milliseconds up to very few seconds). We would need @malex984 to confirm that, but I think the station script just starts the new container in the background and then exits. So this should be very fast after all.

porst17 commented 6 years ago

I checked the code. Looks like docker-compose starts everything in the background via -d (detached mode). It is not the same as starting the whole docker-compose process in background (with &). I suppose it is waiting until all containers are spun up. But it is probably still quite fast, because it does not wait until the actual application is ready ... something docker-compose can't know.

malex984 commented 6 years ago

@porst17 @elondaits Starting a new container is quite fast (with previously pulled image and prepared cached plain docker-compose file, which all happens only during Hilbert stack initialization). Unfortunately the stopping of container is not immediate and involve short delays... there is a system variable hibert_station_process_kill_timeout (default: 8s), which controls those delays. Please check https://github.com/hilbert/hilbert-cli/blob/devel/tools/hilbert-station#L1581 for details. My feeling that a single container should be stopped/killed/removed in approx. 2x(hibert_station_process_kill_timeout+2) seconds.

@porst17 Should i add a timer to measure low-level operations precisely?

porst17 commented 6 years ago

Regarding the slow shutdown of containers see hilbert/hilbert-docker-images#61

hilbert / hilbert-cli

Provide more state information to the dashboard #83