openziti / ziti

The parent project for OpenZiti. Here you will find the executables for a fully zero trust, application embedded, programmable network @OpenZiti
https://openziti.io
Apache License 2.0
2.67k stars 153 forks source link

Panic if unable to write to event handler files #2314

Open emoscardini opened 1 month ago

emoscardini commented 1 month ago

If the controller doesn't have permissions to write to a file configured in the event handler, the follow panic happens:

panic: cannot write to log file path: /var/log/ziti/utilization-events.log
goroutine 1 [running]:
github.com/openziti/ziti/controller.(*Controller).Run(0xc000c5dc20)
github.com/openziti/ziti/controller/controller.go:381 +0xcce
{0xc000c4a130, 0x1, 0x3ad5116?})
github.com/openziti/ziti/ziti/controller/run.go:95 +0xabf
{0xc000c4a100, 0x1, 0x1})
github.com/spf13/cobra@v1.8.1/command.go:989 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0x5b33880)
github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/cobra@v1.8.1/command.go:1041
github.com/openziti/ziti/ziti/cmd.Execute()
github.com/openziti/ziti/ziti/cmd/cmd.go:81 +0x1a
main.main()
github.com/openziti/ziti/ziti/main.go:51 +0xf

The process should log that it's unable to access the file & continue.

dovholuknf commented 1 month ago

Interesting. As a user, I would rather it panic and not start so that I would be aware of the situation before it ran for "however long" and then i discovered a problem.

I'm curious why you'd rather have it operating in a mode where it's degraded. Should there be events emitted indicating a problem?

michaelquigley commented 1 month ago

I also tend to think I'd rather it panic, rather than not do what I configured it to do.

emoscardini commented 1 month ago

Not sure how you're going to emit an event if the event handler can't output...

I'm suggesting we fix the panic & log the process doesn't have access to write to the events handler file specified. It can log an ERROR for every time it tries to write.

IMHO, the process shouldn't ever panic. Even switching to exiting the process gracefully after it logs that it can't write would be better.

r-caamano commented 1 month ago

Seems to me that it would be more important to have a network stay up than to panic that it cant write to an event log. As Edward mentioned we log that we can't access the file in journald.

michaelquigley commented 1 month ago

It only panics at startup, not during normal execution. It's panic-ing because you've specified an invalid configuration... you're asking for the controller to consider some parts of the configuration as "optional" or "best effort"... but for a lot of configurations, not having functional events for any period of time (zrok) is a broken state, even if the rest of the network is up.

michaelquigley commented 1 month ago

Maybe a compromise might be to mark that event handler as optional: true or abort-on-error: startup, abort-on-error: anytime... or some other way to mark that event handler as non-critical.

r-caamano commented 1 month ago

if the controller crashes it could possibly not close the file properly which could result in the file becoming "stale locked" so when it attempts to open the file at next start it can't. Since this is an automated restart no one is going to be there to immediately see that the file is locked an it will just continue to cycle.