When we hit a failsafe on the number of hosts to modify, we usually have to manually inspect the changes ZAC wants to make* in order to approve the changes. Afterwards, we have to temporarily change the config to allow for a greater number of hosts than usual to be modified, restart the application, let it run, then change the config back and restart it again.
This is cumbersome and not super ideal. We should have some way to signal to the application that we approve the changes. After the application has made the changes, the approval (however we give it) should be voided, so that we have to re-approve the next time the failsafe is hit.
How to approve
There are many ways to implement one-time approval. Here are some suggestions:
Intrinsic Solutions
The following changes don't require changes to the architecture or associated services of the application. They are purely source code-driven changes.
File-based
The simplest solution
The file is deleted after the approval has been parsed.
Signals
Send a signal to the application (SIGUSR1, SIGUSR2, SIGHUP, etc.)
Non-trivial because we need to either find the PID of the specific ZabbixHostUpdater process or send the signal to the main process, after which the main process has to communicate the approval to the host updater process.
Environment variable
Very easy to implement the approval, but harder to implement automatic retraction of re-approval.
Can't rely on modifying environment variables on runtime.
Would need some sort of datetime in the envvar value
Format: YYYYMMDD-HHMM
Where ZAC_FAILSAFE_APPROVE=20240110-1450 would be valid from 14:50 to 14:55 on 2024-01-10
Extrinsic (add-on) solutions
These changes require larger changes to the codebase itself via new dependencies or via new services attached to it.
Could be good for similar issues in the future, but overkill for this issue alone.
REST API (FastAPI, Flask, etc.)
Implement endpoint for approving the current failsafe limit
Could also implement other introspection endpoints.
Get list of hosts waiting to be approved.
Filter by add, modify, remove.
And other resources...
Requires new dependencies.
Requires some way for the main process to communicate the approval with the host updater process.
Message Queue (RabbitMQ, Kafka, etc.)
Main process can consume messages from queue.
* Inspecting the changes it wants to make is not easy, and often requires either print debugging or attaching a debugger. Lack of introspection is a huge issue.
When we hit a failsafe on the number of hosts to modify, we usually have to manually inspect the changes ZAC wants to make* in order to approve the changes. Afterwards, we have to temporarily change the config to allow for a greater number of hosts than usual to be modified, restart the application, let it run, then change the config back and restart it again.
This is cumbersome and not super ideal. We should have some way to signal to the application that we approve the changes. After the application has made the changes, the approval (however we give it) should be voided, so that we have to re-approve the next time the failsafe is hit.
How to approve
There are many ways to implement one-time approval. Here are some suggestions:
Intrinsic Solutions
The following changes don't require changes to the architecture or associated services of the application. They are purely source code-driven changes.
ZabbixHostUpdater
process or send the signal to the main process, after which the main process has to communicate the approval to the host updater process.YYYYMMDD-HHMM
ZAC_FAILSAFE_APPROVE=20240110-1450
would be valid from 14:50 to 14:55 on 2024-01-10Extrinsic (add-on) solutions
These changes require larger changes to the codebase itself via new dependencies or via new services attached to it. Could be good for similar issues in the future, but overkill for this issue alone.
* Inspecting the changes it wants to make is not easy, and often requires either print debugging or attaching a debugger. Lack of introspection is a huge issue.