Closed RaSerge closed 1 year ago
up
If I understand your suggestion properly — and correct me if I'm wrong —, you'd like the operator to ignore flows and outputs that have errors and use only the rest to generate the Fluentd config. We do not do this by design, as this would result in some missing logs until the erroneous resource configs are fixed, while if the Fluentd config is not generated at all, Fluentbits will buffer logs until the config is generated after fixing the errors.
Ok.Do not ignore flows without outputs looks reasonable. But it's first first part of a problem. Why this case (flow without output) broke reconcile loop and operator starts to ignore another (applied after) cr's? It's looks like a fifo pipe, but imo operator should continue reconcile correct cr's and still generate fluend configs. Our case: we have many teams that creates and deletes ns with your operators cr's. When one of a teams create wrong output or just a flow without output that broke logs collecting for rest of them.
@RaSerge can you please elaborate more on what happens exactly?
So are you applying the resources from scratch or on top of an existing setup? Does the operator able to create a new configcheck pod? If you fix the error does the operator recover from it?
Unfortunately if an error exists in the system the operator cannot decide what would be the correct action to take. If we leave out a defective flow/output from the config fluentd won't buffer messages for that.
My best idea here is that we may introduce configuration flags so that you can decide what kind of errors you may wan't to tolerate.
Anyways I would recommend setting up a staging area where you can first try out changes.
@pepov
So are you applying the resources from scratch or on top of an existing setup? We had initial logging-operator deployment controlled by ops team. Developer teams only deploys flows and outputs cr in their namespaces to configure logs collecting.
Does the operator able to create a new configcheck pod? operator don't create new configcheck pod by design - if they found failed configcheck pod or running configcheck pod it don't start another checks.
If you fix the error does the operator recover from it? yes. but we need to manually found this bad resources and delete or fix them. sometimes (i cannot reproduce this with 100% probability) operator just stop do reconcile loop until we restart operator pod.
Idea with configuration flags looks reasonable. Can i help with testing?
I'd love to see this solution implemented for it to work in a multi-tenant environment where users can make mistakes and we want these not to impact other users configurations. Can I help somehow?
we are very interested in this implementation for a multitenant environment, where people can do mistake and impact other users. We can help with testing the solution.
We had a discussion about this issue. There are 2 things that we can do in short term.
logging_resource_state{name="s3",namespace="logging",status="active", kind="ClusterFlow"} = 1
logging_resource_state{name="test",namespace="default",status="inactive", kind="Flow"} = 1
SkipInvalidResoruce
so orphan resources won't interfere with other resources. This won't solve everything but a step forward.Later it is possible to do some "smart" configuration checking by eliminating flows from the configuration but that raises a lot of questions and complexities. I hope that the previous 2 items will help to make the user experience better.
I am currently having the exact same problem raised in this issue and the suggestion by @tarokkk seems to be a sufficient solution 👍
Especially, the option like SkipInvalidResource
is totally what I need since we want to prevent other users from failing to apply new resources due to the fault of one user.
Although skipInvalidResources
set true, only controller skips invalid resources in reconcile logic, but fluentd tries to update the config using invalid resources. It causes config-check error and no update for fluentd config.
I don't know it is expected behavior, but I think invalid resources will also exclude when making an updated config.
I see the same behavior @seongpyoHong. I guess RegisterFlow has to be skipped if FlowForFlow returns an error (in the case SkipInvalidResources is enabled) here: https://github.com/banzaicloud/logging-operator/blob/4fb3eec0623c9a9780d451d7d01f29004a699e8c/pkg/resources/model/system.go#L82 Wdyt @ahma @siliconbrain
Describe the bug: When we create Output CR before Flow CR, or create Output with wrong secret mapping (example in log bellow) then create Flow with OutputRefs on that Output. Operator reconcile stuck on that "wrong" Flow-noOutput resources and not reconcile following resources created after that.
Expected behaviour: Operator ignore Flow CR without existing Output (ClusterOutput) CR
Steps to reproduce the bug: just apply
Additional context: Add any other context about the problem here.
Environment details:
/kind bug