unioslo / zabbix-auto-config

MIT License
3 stars 6 forks source link

Should crash if configured collectors are missing #57

Open paalbra opened 1 year ago

paalbra commented 1 year ago

If you have a configured collector, but the collector is missing, zac will just continue:

https://github.com/unioslo/zabbix-auto-config/blob/2b45f1cb7da0d46b8b218005ebbf751cb17f8793/zabbix_auto_config/__init__.py#L29-L33

This might lead to unexpected behavior and bad results. It's probably better to raise a ZACException and crash? Thoughts?

pederhan commented 1 year ago

We don't have the manpower to babysit bad source collectors and host modifiers, so failing straight up would not be ideal for us. If we are provided with a bad source collector, it shouldn't interfere with the good ones.

We discussed this issue today, and want to implement a system wherein source collectors (and modifiers?) are disabled after failing X times in Y seconds. This allows us to integrate potentially unstable source collectors without jeopardizing the overall stability of ZAC.

Some source collectors might be more unstable than others; would it make sense for an overall failure tolerance setting, which can be optionally overridden by individual source collectors? Something like this for global configuration:

[zac.source_collectors]
failure_tolerance = 5
failure_interval = 300

And this for individual source collector configuration:

[source_collectors.mysource]
module_name = "mysource"
update_interval = 60
failure_tolerance = 10
failure_interval = 180

Alternatively:

[source_collectors.mysource]
module_name = "mysource"
update_interval = 60

[source_collectors.mysource.failure]
tolerance = 10
interval = 180

These are the names suggested by ChatGPT (I struggled to come up with names myself):

  1. Alternative names for X (threshold for number of tolerated failures):

    • failure_tolerance
    • failure_threshold
    • failure_limit
    • failure_count
    • failure_maximum
  2. Alternative names for Y (duration in which failures are measured):

    • failure_duration
    • failure_window
    • failure_measurement_period
    • failure_timeframe
    • failure_interval

Finally, to support your proposed behavior there could be a configuration option that bypasses this error tolerance strategy entirely (struggling to come up with a good name for this):

[zac.source_collectors]
exit_on_failure = true # immediately exit upon encountering a failing source collector
paalbra commented 1 year ago

Now we're not talking about the same issue. Missing collector != failing collector (or failing modifier).

Missing collector, this issue

A missing collector means that all hosts collected by the collector should be deleted from zabbix (might not really be true as long as this issue persists #11 ). This is quite severe. I think ZAC should crash rather than continue, as stated in this issue (even though the failsafe probably will trigger. There is no need to rely on the safety net when you know you're failing).

Failing collector

This isn't a problem? There is nothing to fix?

It will raise an exception:

https://github.com/unioslo/zabbix-auto-config/blob/2b45f1cb7da0d46b8b218005ebbf751cb17f8793/zabbix_auto_config/processing.py#L106

Log and go into a bad state (which can be monitored):

https://github.com/unioslo/zabbix-auto-config/blob/2b45f1cb7da0d46b8b218005ebbf751cb17f8793/zabbix_auto_config/processing.py#L54-L59

This means that the current collect run is discarded and ZAC will retry with the normal interval. If the next run is ok the state will also become ok.

(I'm pretty sure this happens from now and then at UiO, with e.g. Nivlheim being unavailable, if I remember correct?)

Failing modifier

Then a host is just not modified:

https://github.com/unioslo/zabbix-auto-config/blob/2b45f1cb7da0d46b8b218005ebbf751cb17f8793/zabbix_auto_config/processing.py#L309-L311

This could lead to a weird state/unknown problems (very dependent on the modifier, obviously). You can tell I was wondering what to do from the comment. I now think that the current host that is being modified should probably be discarded (kind of like a failing collector), but this is really another issue than what I'm mentioning here.

pederhan commented 1 year ago

Ok, my bad, I can create a separate issue. But don't you think there should consistent behavior both in terms of loading and running external modules?

Loading a modifier:

https://github.com/unioslo/zabbix-auto-config/blob/2b45f1cb7da0d46b8b218005ebbf751cb17f8793/zabbix_auto_config/processing.py#L236-L243

Loading a collector:

https://github.com/unioslo/zabbix-auto-config/blob/2b45f1cb7da0d46b8b218005ebbf751cb17f8793/zabbix_auto_config/__init__.py#L28-L33

In the case of collectors, we at least check if the import was successful or not. Even if we're fairly sure a module exists when loading it (in the case of host modifiers), it might contain invalid syntax or otherwise be broken. So some sort of unified interface for loading modules would be nice:

import importlib
from types import ModuleType

def load_module(name: str) -> ModuleType:
    try:
        return importlib.import_module(name)
    except Exception as e:
        if isinstance(e, ModuleNotFoundError):
            msg = f"Unable to find module named {name!r}"
        else:
            msg = f"Unable to import module named {name!r}: {str(e)}"
        logger.error(msg)
        raise ZACException(msg) from e

And then based on the config, callers of this function could decide whether or not a ZACException should be fatal.

paalbra commented 1 year ago

I already created #58 .

I think collectors and modifiers currently behave differently and probably can't really be looked at like some general external module.

One is explicitly stated in the config. The other is "just there".

I'm happy to hear suggestions, but I still think this issue is valid: zac shouldn't start if collectors (explicitly stated in the config) are missing.

Isn't that last code bit just kind of wrapping the exception? Does it really add anything?

pederhan commented 1 year ago

Isn't that last code bit just kind of wrapping the exception? Does it really add anything?

Not by itself no, but callers of this function would know that they only need to handle ZACException, which is more predictable than all the possible error states from importlib.import_module. Furthermore, it would deduplicate the import code used for collectors and modified.

So instead of

https://github.com/unioslo/zabbix-auto-config/blob/2b45f1cb7da0d46b8b218005ebbf751cb17f8793/zabbix_auto_config/__init__.py#L23-L33

we would do:

def get_source_collectors(config: models.Settings) -> List[SourceCollectorDict]:
    source_collector_dir = config.zac.source_collector_dir
    sys.path.append(source_collector_dir)

    source_collectors = []  # type: List[SourceCollectorDict]
    for (source_collector_name, source_collector_values) in config.source_collectors.items():
        try:
            load_module(source_collector_values.module_name)
        except ZACException as e:
            if config.zac.source_collectors.ignore_missing:
                logging.warning(
                    "Source collector '%s' import failed: '%s'",
                    source_collector_values.module_name,
                    str(e),
                )
                continue
            raise
       # exceptions not handled by `load_module` will always fail
    # ...

So yes, it "only" wraps the error, and we could do a bare try...except Exception:, but this way we always log when an import fails in load_module, but the caller decides how to handle the missing import. The same load_module function should of course be re-used when importing host modifiers.

Anyway, I don't feel too strongly about this, but I think fundamentally importing modifiers and collectors is the same thing; we are doing dynamic imports on runtime, and there should be predictable behavior shared between the two.

paalbra commented 1 year ago

I have no strong opinion when it comes to wrapping in a load_module. It could be useful.

My only strong opinion in this issue is that zac should fail rather than continue when a collector is missing. It's a severe error.

There is a big difference here between collectors and modifiers since only collectors can be missing. Zac doesn't know what modifiers to expect. Maybe this should change in the future. Maybe zac should have configured a list of enabled modifiers. I think that's not a bad idea.