Alert logging is too verbose

mercedes-benz / sechub

SecHub provides a central API to test software with different security tools.

MIT License

268 stars 65 forks source link

Situation

The current monitoring alert logging is too verbose. Example: If the system logs "CPU OVERLOAD" these messages will be logged every second for a minute which results in an overblown and difficult to read log.

Wanted

Implement a mechanism to ensure that logs are generated with a time delay, reducing the frequency of log entries. Ensure that the first log entry is always generated immediately, but subsequent entries should be spaced out over time.

Solution

This change aims to make the logs more manageable and less overwhelming, while still providing necessary information.

Action Items:

Implement time-delayed logging for monitoring alerts.
Ensure the first log entry is generated immediately.
Test the new logging mechanism to confirm it reduces verbosity without losing critical information.

This is imo a perfect use case for a watchdog implementation

A watchdog generally refers to a mechanism that monitors the health and performance of components, services or systems (like cpu) and takes action if certain thresholds are exceeded

Implement a watchdog that periodically monitors the systems cpu
Depending on the cpu usage set a atomic boolean to either true or false (something like isSystemHealthy or isJobProcessable)
Alert every state changes of the system inside the watchdog. Something like: "System is under heavy load", "90% Cpu usage", "Job processing halted", "System is healthy again", "System ready to process jobs"
Log every X seconds "System cpu is still overloaded". Xshould be a meaningful delay. Ideally @haerter-tss or @sven-dmlr decide the value.
In any job scheduler implementation evaluate the atomic boolean. Skip the job processing if false and print a debug log like "Job processing is skipped"

This will also further decouple the actual job processing logic from the cpu monitoring, making everything more modular and better testable.

Further Questions:

Is there any mechanism to handle a system that is under load for a very long time? (e.g. infinite loops, busy waiting, etc..)

mercedes-benz / sechub