open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.18k stars 422 forks source link

Auto-instrumentation lost on resumption of cluster from hibernation #1329

Open santoshkashyap opened 1 year ago

santoshkashyap commented 1 year ago

We have our K8S development clusters set to hibernate everyday at the end of regular work hours. The cluster becomes active the next day. We have setup opentelemetry-operator on our cluster. Also configured OpenTelemetry Collector as a Daemon with corresponding annotations on the pod (Java/NodeJS) For Java:

# format <namespace/otel instrumentation CR>
instrumentation.opentelemetry.io/inject-java=dev-opentelemetry/opentelemetry-instrumentation

For NodeJS:

# format <namespace/otel instrumentation CR>
instrumentation.opentelemetry.io/inject-nodejs=dev-opentelemetry/opentelemetry-instrumentation

With this setup, everything works fine. For example, for Java apps the JavaAgent is volume mounted automatically. The agent instruments the application and ships the traces to a OpenTelemetry collector pod (created via Otel CR via the operator). Finally, the collector pod ships the traces to our Observability backend service. However, when the workload resumes the next day after hibernation everything seems to be lost (see screen shot below). Not sure why it happens ? There is not much information in the application logs or OpenTelemetry daemon pod logs or even in the opentelemetry-operator-controller-manager pod in the opentelemetry-operator-system namespace.

Container spec before hibernation: image

After resumption from hibernation: OpenTelemetry setup is lost image

Thanks in advance!!!

pavolloffay commented 1 year ago

Is the hibernation shutting down all pods? If that is the case I would say that the OTEL operator starts after the application pods.

(The OTEL operator as well uses the admission mutating webhook to install the auto-instrumentations)

Is there a way you could control the starting order of the pods- give the infra/otel operator pods a higher priority?

santoshkashyap commented 1 year ago

Thanks for the pointer. I will verify this and update you again tomorrow after another hibernation.

santoshkashyap commented 1 year ago

Is the hibernation shutting down all pods? If that is the case I would say that the OTEL operator starts after the application pods.

Yes, on our cluster this seems to be the case. OTEL operator seems to start after the application pods. To mitigate this issue, I created a Priority class and assigned it to the opentelemetry-operator-controller-manager pod. I will update again tomorrow after cluster hibernation whether this approach works.

pavolloffay commented 1 year ago

@santoshkashyap any news on this ticket?

Can we close it?

As a fix maybe we could set the priority class by default?

santoshkashyap commented 1 year ago

Unfortunately, this still doesn't seem to work.

I have assigned higher priority for operator image

application pod still have no priority class assigned, hence defaults to '0'.

With this setup, I still see application pods are in running state while the OTEL operator is still in container creating

image

Also it seems to be that even though OTEL operator is scheduled early, it seems to take sometime to complete container creation.

A workaround we are discussing is to have some kind of cronjob that runs daily to rollout restart application deployment after resumption. Meanwhile if there is anything I can try, please let me know.

jaronoff97 commented 10 months ago

@santoshkashyap is this still a problem? We've refactored how reconciliation works which i think should help with this.

M1lk4fr3553r commented 6 months ago

Hi @jaronoff97, yes, this issue still exists in version 0.96.0. Feel free to ping me if you need any assistance in resolving this issue.

jaronoff97 commented 6 months ago

@M1lk4fr3553r do you have an easy way to reproduce this? I run the operator locally on a kind cluster with autoinstrumentation and it idles and awakes fine.

M1lk4fr3553r commented 6 months ago

I have created this chart to show the issue.
Once you deploy the chart, you will notice that the pod created from deployment-to-instrument has not been injected.
This occurs because the pod is created before the operator pod is ready to inject other pods. (This behavior is documented here) There should ideally be a way to let the operator restart pods that should be injected but aren't.

jaronoff97 commented 6 months ago

@M1lk4fr3553r this is a limitation of our current webhook configuration. Right now we only get the injection events on pod creation (see here) and I'm not sure the best way to get around that. The Istio operator functions the same way, I wonder if they have a way of solving this issue... I'll ask around and see if there's anything we can do here.

swiatekm commented 6 months ago

Having the operator delete arbitrary Pods sounds like a dangerous capability that I'd rather not add unless we have no other choice.

If you'd like your Pods to wait until the operator starts and is able to inject instrumentation, you can set the webhook failurePolicy to Fail. The Pod will be rejected by the API Server, and its controller will start retrying until successful.

This is a dangerous setting, as by default it will reject ALL Pods, the operator itself included. If you go down this path, please make sure to also set objectSelector on the webhook to ignore your critical system services.

M1lk4fr3553r commented 6 months ago

Having the operator delete arbitrary Pods sounds like a dangerous capability that I'd rather not add unless we have no other choice.

I would not simply delete the pods, I was thinking of rollout restarting the deployment. That way a new pod can spin up before and there should be no danger of a downtime.

Also, in any case, this should be an option that is off by default, since I doubt that anyone is shutting down their production cluster every day. For development and integration clusters, it does not seem uncommon to shut them down during non-working hours to save money.

swiatekm commented 6 months ago

I would suggest trying out the webhook settings first, since that seems like a more idiomatic solution to your problem. If you want a rolling restart of your Deployments/StatefulSets/DaemonSets, you can always create a Job with a simple Go program (or even a bash script) that waits until the operator is ready, and then takes care of the restarts. You then have control over what exactly happens to your workloads and in which order.

KarstenWintermann commented 5 months ago

To my knowledge https://github.com/Azure/AKS/issues/4002 currently prevents setting the objectSelector correctly in AKS through the helm chart, which means that right now there is no reliable way of using auto instrumentation and sidecar injection with the operator helm chart in AKS. Also, I think that the default settings in the helm chart need to prevent this issue, since it initially isn't obvious.

The way this is done in dapr (periodically checking and deleting pods with missing injected sidecars) may not be perfect, but it works for me.