unioslo / zabbix-auto-config

MIT License
2 stars 5 forks source link

Expand and refactor state tracking #70

Closed pederhan closed 10 months ago

pederhan commented 10 months ago

This PR expands the state tracking functionality of the application by adding a State class that is accessible via a multiprocessing.SyncManager utility function named get_manager() in the new state module.

The state class implements ok like the existing state dict, but adds new error-related attributes that can be used to gather more granular metrics and enable easier diagnostics:

https://github.com/unioslo/zabbix-auto-config/blob/4e5987f80d529bdd405c686dc1dd12fae784aae4/zabbix_auto_config/state.py#L9-L43

The set_error() method processes the error and sets the appropriate state attributes based on the contents of the error. Each error increments a new error counter (error_count).

The set_ok() method sets the current state to OK and clears any error messages from the state, while keeping the error count intact. This maintains the total number of errors per process in the health file even when the process has returned to an OK state.

Expanded health file process entries

Each process in the health file has received new keys corresponding to the fields of the new State class.

A healthy process looks like the following:

{
  "name": "zabbix-template-updater",
  "pid": 3368,
  "alive": true,
  "ok": true,
  "error": null,
  "error_type": null,
  "error_count": 0,
  "error_time": null
}

While an unhealthy process looks like this:

{
  "name": "faultysource",
  "pid": 3355,
  "alive": true,
  "ok": false,
  "error": "Failed to collect from source 'faultysource': Source collector error!!",
  "error_type": "ZACException",
  "error_count": 1,
  "error_time": 1700580801.918863
}

Motivation

This PR enables more granular health information for each process in the health file. The choice to move away from a dict to a concrete class was made so that we could add the utility methods set_ok() and set_error(). Adding such functionality to a dict would have required helper functions that manipulated dict keys and performed numerous dict.setdefault() and isinstance() calls to even come close to guaranteeing any sort safety with regards to types and arbitrary key access. Thus, it was just simpler to rewrite the state as a Pydantic-backed dataclass.

With this new class, we can simply call State.asdict() to get a dict representation of the State instance with all attributes guaranteed to be present as dict keys when we dump the state to the health file. This makes parsing the file easier, as we don't have to check for the existence of keys before we access them.

Testing

Since the State class is synced between processes via the manager, we are able to more accurately test subprocesses due to increased introspection of each process' state. Information such as number of errors and exception types are very difficult to test with the existing state tracking, but with the help of the new state attributes in this PR, we can easily access that information.