Open paalbra opened 1 year ago
We don't have the manpower to babysit bad source collectors and host modifiers, so failing straight up would not be ideal for us. If we are provided with a bad source collector, it shouldn't interfere with the good ones.
We discussed this issue today, and want to implement a system wherein source collectors (and modifiers?) are disabled after failing X times in Y seconds. This allows us to integrate potentially unstable source collectors without jeopardizing the overall stability of ZAC.
Some source collectors might be more unstable than others; would it make sense for an overall failure tolerance setting, which can be optionally overridden by individual source collectors? Something like this for global configuration:
[zac.source_collectors]
failure_tolerance = 5
failure_interval = 300
And this for individual source collector configuration:
[source_collectors.mysource]
module_name = "mysource"
update_interval = 60
failure_tolerance = 10
failure_interval = 180
Alternatively:
[source_collectors.mysource]
module_name = "mysource"
update_interval = 60
[source_collectors.mysource.failure]
tolerance = 10
interval = 180
These are the names suggested by ChatGPT (I struggled to come up with names myself):
Alternative names for X (threshold for number of tolerated failures):
Alternative names for Y (duration in which failures are measured):
Finally, to support your proposed behavior there could be a configuration option that bypasses this error tolerance strategy entirely (struggling to come up with a good name for this):
[zac.source_collectors]
exit_on_failure = true # immediately exit upon encountering a failing source collector
Now we're not talking about the same issue. Missing collector != failing collector (or failing modifier).
A missing collector means that all hosts collected by the collector should be deleted from zabbix (might not really be true as long as this issue persists #11 ). This is quite severe. I think ZAC should crash rather than continue, as stated in this issue (even though the failsafe probably will trigger. There is no need to rely on the safety net when you know you're failing).
This isn't a problem? There is nothing to fix?
It will raise an exception:
Log and go into a bad state (which can be monitored):
This means that the current collect run is discarded and ZAC will retry with the normal interval. If the next run is ok the state will also become ok.
(I'm pretty sure this happens from now and then at UiO, with e.g. Nivlheim being unavailable, if I remember correct?)
Then a host is just not modified:
This could lead to a weird state/unknown problems (very dependent on the modifier, obviously). You can tell I was wondering what to do from the comment. I now think that the current host that is being modified should probably be discarded (kind of like a failing collector), but this is really another issue than what I'm mentioning here.
Ok, my bad, I can create a separate issue. But don't you think there should consistent behavior both in terms of loading and running external modules?
Loading a modifier:
Loading a collector:
In the case of collectors, we at least check if the import was successful or not. Even if we're fairly sure a module exists when loading it (in the case of host modifiers), it might contain invalid syntax or otherwise be broken. So some sort of unified interface for loading modules would be nice:
import importlib
from types import ModuleType
def load_module(name: str) -> ModuleType:
try:
return importlib.import_module(name)
except Exception as e:
if isinstance(e, ModuleNotFoundError):
msg = f"Unable to find module named {name!r}"
else:
msg = f"Unable to import module named {name!r}: {str(e)}"
logger.error(msg)
raise ZACException(msg) from e
And then based on the config, callers of this function could decide whether or not a ZACException
should be fatal.
I already created #58 .
I think collectors and modifiers currently behave differently and probably can't really be looked at like some general external module.
One is explicitly stated in the config. The other is "just there".
I'm happy to hear suggestions, but I still think this issue is valid: zac shouldn't start if collectors (explicitly stated in the config) are missing.
Isn't that last code bit just kind of wrapping the exception? Does it really add anything?
Isn't that last code bit just kind of wrapping the exception? Does it really add anything?
Not by itself no, but callers of this function would know that they only need to handle ZACException, which is more predictable than all the possible error states from importlib.import_module
. Furthermore, it would deduplicate the import code used for collectors and modified.
So instead of
we would do:
def get_source_collectors(config: models.Settings) -> List[SourceCollectorDict]:
source_collector_dir = config.zac.source_collector_dir
sys.path.append(source_collector_dir)
source_collectors = [] # type: List[SourceCollectorDict]
for (source_collector_name, source_collector_values) in config.source_collectors.items():
try:
load_module(source_collector_values.module_name)
except ZACException as e:
if config.zac.source_collectors.ignore_missing:
logging.warning(
"Source collector '%s' import failed: '%s'",
source_collector_values.module_name,
str(e),
)
continue
raise
# exceptions not handled by `load_module` will always fail
# ...
So yes, it "only" wraps the error, and we could do a bare try...except Exception:
, but this way we always log when an import fails in load_module
, but the caller decides how to handle the missing import. The same load_module
function should of course be re-used when importing host modifiers.
Anyway, I don't feel too strongly about this, but I think fundamentally importing modifiers and collectors is the same thing; we are doing dynamic imports on runtime, and there should be predictable behavior shared between the two.
I have no strong opinion when it comes to wrapping in a load_module
. It could be useful.
My only strong opinion in this issue is that zac should fail rather than continue when a collector is missing. It's a severe error.
There is a big difference here between collectors and modifiers since only collectors can be missing. Zac doesn't know what modifiers to expect. Maybe this should change in the future. Maybe zac should have configured a list of enabled modifiers. I think that's not a bad idea.
If you have a configured collector, but the collector is missing, zac will just continue:
https://github.com/unioslo/zabbix-auto-config/blob/2b45f1cb7da0d46b8b218005ebbf751cb17f8793/zabbix_auto_config/__init__.py#L29-L33
This might lead to unexpected behavior and bad results. It's probably better to raise a
ZACException
and crash? Thoughts?