Open emoscardini opened 1 month ago
Interesting. As a user, I would rather it panic and not start so that I would be aware of the situation before it ran for "however long" and then i discovered a problem.
I'm curious why you'd rather have it operating in a mode where it's degraded. Should there be events emitted indicating a problem?
I also tend to think I'd rather it panic, rather than not do what I configured it to do.
Not sure how you're going to emit an event if the event handler can't output...
I'm suggesting we fix the panic & log the process doesn't have access to write to the events handler file specified. It can log an ERROR
for every time it tries to write.
IMHO, the process shouldn't ever panic. Even switching to exiting the process gracefully after it logs that it can't write would be better.
Seems to me that it would be more important to have a network stay up than to panic that it cant write to an event log. As Edward mentioned we log that we can't access the file in journald.
It only panics at startup, not during normal execution. It's panic-ing because you've specified an invalid configuration... you're asking for the controller to consider some parts of the configuration as "optional" or "best effort"... but for a lot of configurations, not having functional events for any period of time (zrok) is a broken state, even if the rest of the network is up.
Maybe a compromise might be to mark that event handler as optional: true
or abort-on-error: startup
, abort-on-error: anytime
... or some other way to mark that event handler as non-critical.
if the controller crashes it could possibly not close the file properly which could result in the file becoming "stale locked" so when it attempts to open the file at next start it can't. Since this is an automated restart no one is going to be there to immediately see that the file is locked an it will just continue to cycle.
If the controller doesn't have permissions to write to a file configured in the event handler, the follow panic happens:
The process should log that it's unable to access the file & continue.