vmware / kube-fluentd-operator

Auto-configuration of Fluentd daemon-set based on Kubernetes metadata
Other
319 stars 102 forks source link

Improve reloader loop pre-startup fluentd #276

Open Cryptophobia opened 3 years ago

Cryptophobia commented 3 years ago

Reported by @alex-vmw:

There needs to be some different logic for first start of the reloader versus a standard control loop once it is already running. Logically, you can probably set some firstStart=true variable and then set it to false when first startup loop finished.

a) time="2021-10-14T05:42:05Z" level=info msg="Sleeping for 30 seconds in order for fluentd to be ready." - Reloader should NOT wait for 30 seconds (it shouldn't wait at all) for fluentd to come up. During first boot, fluentd can NOT come up without reloader providing it the fluentd.conf file, so reloader should immediatelly get to work and NOT wait.

b) time="2021-10-14T05:46:41Z" level=error msg="cannot notify fluentd: Post \"http://127.0.0.1:24444/api/config.gracefulReload\": dial tcp 127.0.0.1:24444: connect: connection refused" - Reloader control loop should NOT try to issue a fluentd reload during first startup because it is NOT required and will always fail.

Update on 1b: I was replacing the reloader image just now and was thinking. I think if we remove the http://127.0.0.1:24444/api/config.gracefulReload call during the first boot there will be an issue if the reloader was restarted due to some issue. If reloader was restarted due to some issue (OOM, etc.) it would in fact need to reload fluentd in case any configs have changed during the time reloader was down.

alex-vmw commented 3 years ago

b) time="2021-10-14T05:46:41Z" level=error msg="cannot notify fluentd: Post \"http://127.0.0.1:24444/api/config.gracefulReload\": dial tcp 127.0.0.1:24444: connect: connection refused" - Reloader control loop should NOT try to issue a fluentd reload during first startup because it is NOT required and will always fail.

I now think that if we remove the http://127.0.0.1:24444/api/config.gracefulReload call during the first boot there will be an issue if the reloader was restarted due to some issue. If reloader was restarted due to some issue (OOM, etc.) it would in fact need to reload fluentd in case any configs have changed during the time reloader was down. It appears that we better keep that call even if it will fail 99% of the time. The other 1% of the time it would actually avoid an issue with updating the config.