kube-logging / logging-operator

Logging operator for Kubernetes
https://kube-logging.dev
Apache License 2.0
1.54k stars 329 forks source link

Problew with CR Flow without Output #615

Closed RaSerge closed 1 year ago

RaSerge commented 3 years ago

Describe the bug: When we create Output CR before Flow CR, or create Output with wrong secret mapping (example in log bellow) then create Flow with OutputRefs on that Output. Operator reconcile stuck on that "wrong" Flow-noOutput resources and not reconcile following resources created after that.

Expected behaviour: Operator ignore Flow CR without existing Output (ClusterOutput) CR

Steps to reproduce the bug: just apply

Additional context: Add any other context about the problem here.

Environment details:

RaSerge commented 3 years ago

up

siliconbrain commented 3 years ago

If I understand your suggestion properly — and correct me if I'm wrong —, you'd like the operator to ignore flows and outputs that have errors and use only the rest to generate the Fluentd config. We do not do this by design, as this would result in some missing logs until the erroneous resource configs are fixed, while if the Fluentd config is not generated at all, Fluentbits will buffer logs until the config is generated after fixing the errors.

RaSerge commented 3 years ago

Ok.Do not ignore flows without outputs looks reasonable. But it's first first part of a problem. Why this case (flow without output) broke reconcile loop and operator starts to ignore another (applied after) cr's? It's looks like a fifo pipe, but imo operator should continue reconcile correct cr's and still generate fluend configs. Our case: we have many teams that creates and deletes ns with your operators cr's. When one of a teams create wrong output or just a flow without output that broke logs collecting for rest of them.

pepov commented 3 years ago

@RaSerge can you please elaborate more on what happens exactly?

So are you applying the resources from scratch or on top of an existing setup? Does the operator able to create a new configcheck pod? If you fix the error does the operator recover from it?

Unfortunately if an error exists in the system the operator cannot decide what would be the correct action to take. If we leave out a defective flow/output from the config fluentd won't buffer messages for that.

My best idea here is that we may introduce configuration flags so that you can decide what kind of errors you may wan't to tolerate.

Anyways I would recommend setting up a staging area where you can first try out changes.

RaSerge commented 3 years ago

@pepov

So are you applying the resources from scratch or on top of an existing setup? We had initial logging-operator deployment controlled by ops team. Developer teams only deploys flows and outputs cr in their namespaces to configure logs collecting.

Does the operator able to create a new configcheck pod? operator don't create new configcheck pod by design - if they found failed configcheck pod or running configcheck pod it don't start another checks.

If you fix the error does the operator recover from it? yes. but we need to manually found this bad resources and delete or fix them. sometimes (i cannot reproduce this with 100% probability) operator just stop do reconcile loop until we restart operator pod.

Idea with configuration flags looks reasonable. Can i help with testing?

aslafy-z commented 3 years ago

I'd love to see this solution implemented for it to work in a multi-tenant environment where users can make mistakes and we want these not to impact other users configurations. Can I help somehow?

tboussar commented 2 years ago

we are very interested in this implementation for a multitenant environment, where people can do mistake and impact other users. We can help with testing the solution.

tarokkk commented 2 years ago

We had a discussion about this issue. There are 2 things that we can do in short term.

  1. Export metrics about the resources state. Based on this an administrator can have an overview of the state of the logging subsystem. Based on these metrics it is easy to create alerts if there are faulty resources in them.
    logging_resource_state{name="s3",namespace="logging",status="active", kind="ClusterFlow"} = 1
    logging_resource_state{name="test",namespace="default",status="inactive", kind="Flow"} = 1
  2. The configuration that @pepov mentioned. Unfortunately, it will be limited that if you have a valid but faulty configuration (configcheck fails) it is hard to determine which resource introduced the problem. But we can add a property like SkipInvalidResoruce so orphan resources won't interfere with other resources. This won't solve everything but a step forward.

Later it is possible to do some "smart" configuration checking by eliminating flows from the configuration but that raises a lot of questions and complexities. I hope that the previous 2 items will help to make the user experience better.

heojay commented 2 years ago

I am currently having the exact same problem raised in this issue and the suggestion by @tarokkk seems to be a sufficient solution 👍

Especially, the option like SkipInvalidResource is totally what I need since we want to prevent other users from failing to apply new resources due to the fault of one user.

seongpyoHong commented 2 years ago

Although skipInvalidResources set true, only controller skips invalid resources in reconcile logic, but fluentd tries to update the config using invalid resources. It causes config-check error and no update for fluentd config.

I don't know it is expected behavior, but I think invalid resources will also exclude when making an updated config.

aslafy-z commented 2 years ago

I see the same behavior @seongpyoHong. I guess RegisterFlow has to be skipped if FlowForFlow returns an error (in the case SkipInvalidResources is enabled) here: https://github.com/banzaicloud/logging-operator/blob/4fb3eec0623c9a9780d451d7d01f29004a699e8c/pkg/resources/model/system.go#L82 Wdyt @ahma @siliconbrain