Open jan-kantert opened 6 months ago
We found this by monitoring the prometheus metric certmanager_cainjector_missing_cert
from certmanager which instantly hightlights the issue. Otherwise its quite hard to see unless you check the kube-apiserver logs.
The cert-manager.io/inject-ca-from
annotation is meant to point to the name of a Certificate (compared to cert-manager.io/inject-ca-from-secret
which is meant to point to a Secret).
The webhooks use cert-manager.io/inject-ca-from
so the value must be a Certificate name, not a Secret name. In this case, the value they're using (opentelemetry/opentelemetry-operator-serving-cert
) appears to be correct.
/cc @swiatekm-sumo
/cc @Allex1
The
cert-manager.io/inject-ca-from
annotation is meant to point to the name of a Certificate (compared tocert-manager.io/inject-ca-from-secret
which is meant to point to a Secret).The webhooks use
cert-manager.io/inject-ca-from
so the value must be a Certificate name, not a Secret name. In this case, the value they're using (opentelemetry/opentelemetry-operator-serving-cert
) appears to be correct.
Well kubernetes disagrees. It does not work that way and cert-manager complains about this setup. Upgrade webhooks do not work.
We manually upgraded the CRDs and it works now. But the upgrade totally failed on both Kops AWS and Azure AKS clusters.
I'd like to understand what the problem is exactly. Is it just the conversion webhook misbehaving, or all of them? The latter will be very confusing, as the changes in #1175 and #1176 haven't actually changed anything about the certmanager setup for ValidatingWebhookConfiguration
and MutatingWebhookConfiguration
. The CRDs have changed, and I can see the conversion webhook causing problems, but everything else should be fine.
Can everyone experiencing problems post more details of their environment? The following would help:
values.yaml
@swiatekm-sumo I'm not the OP, but I am experiencing a cert-related issue with the conversion webhook specifically
after upgrading to operator-0.58.2
/ operator-chart-0.99.0
+ following the upgrade instructions here, we're noticing this specific dry-run failure:
failed to prune fields: failed add back owned items: failed to convert pruned object at version opentelemetry.io/v1beta1: conversion webhook for opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector returned invalid metadata: invalid metadata of type <nil> in input object
When I log out the operator pod, I see many TLS handshake errors:
...
2024/05/29 16:52:13 http: TLS handshake error from 127.0.0.6:47149: EOF
2024/05/29 16:52:13 http: TLS handshake error from 127.0.0.6:60753: EOF
2024/05/29 16:59:28 http: TLS handshake error from 127.0.0.6:43895: EOF
2024/05/29 17:06:25 http: TLS handshake error from 127.0.0.6:43385: EOF
2024/05/29 17:11:11 http: TLS handshake error from 127.0.0.6:40295: EOF
2024/05/29 17:11:11 http: TLS handshake error from 127.0.0.6:56065: EOF
2024/05/29 17:11:11 http: TLS handshake error from 127.0.0.6:59265: EOF
2024/05/29 17:11:11 http: TLS handshake error from 127.0.0.6:45403: EOF
...
Your requested info:
Operator Helm Chart values.yaml
➜ helm get values otel-operator
USER-SUPPLIED VALUES:
kubeRBACProxy:
enabled: false
manager:
collectorImage:
repository: otel/opentelemetry-collector-contrib
nodeSelector:
node_pool: observability
replicaCount: 2
topologySpreadConstraints:
- labelSelector:
matchLabels:
app.kubernetes.io/component: controller-manager
app.kubernetes.io/instance: otel-operator
app.kubernetes.io/name: opentelemetry-operator
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
- labelSelector:
matchLabels:
app.kubernetes.io/component: controller-manager
app.kubernetes.io/instance: otel-operator
app.kubernetes.io/name: opentelemetry-operator
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
Previous Operator Helm Chart version
0.57.0 -> 0.58.2
cert-manager version
quay.io/jetstack/cert-manager-controller:v1.14.5
K8s Platform and version
GKE / 1.28.9
@parkedwards just to be sure I understand correctly, you upgrade the operator chart to 0.58.2
, and then you get the error when trying to apply v1alpha1.OpenTelemetryCollector
resources? Can you also post your v1alpha1.OpenTelemetryCollector
resource that you get the error for?
@parkedwards just to be sure I understand correctly, you upgrade the operator chart to
0.58.2
, and then you get the error when trying to applyv1alpha1.OpenTelemetryCollector
resources? Can you also post yourv1alpha1.OpenTelemetryCollector
resource that you get the error for?
Exactly. The webhook TLS configuration is not right. Kubeapi server complains that it cannot load the cert. Cert-manager complains as well and reports the it as unready.
@parkedwards just to be sure I understand correctly, you upgrade the operator chart to
0.58.2
, and then you get the error when trying to applyv1alpha1.OpenTelemetryCollector
resources? Can you also post yourv1alpha1.OpenTelemetryCollector
resource that you get the error for?
@swiatekm-sumo - small correction - we performed the operator/chart upgrade to 0.58.2
AND the v1beta1.OpenTelemetryCollector
version update in the manifests (we use gitops to manage), as described in these instructions
- Update any manifests you have stored outside the cluster, for example in your infrastructure git repository.
- Apply them, so they're all stored as v1beta1.
- Update the OpenTelemetryCollector CRD to only store v1beta1
I can share one of my v1beta1 CR manifest here:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
creationTimestamp: "2023-09-26T14:44:25Z"
generation: 76
labels:
app: otelcol-loadbalancer
app.kubernetes.io/managed-by: opentelemetry-operator
kustomize.toolkit.fluxcd.io/name: opentelemetry-collector
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: otelcol-loadbalancer
namespace: opentelemetry
resourceVersion: "543736253"
uid: f4886fef-31a5-49cc-9f5a-3418913cdc5a
spec:
config:
exporters:
loadbalancing:
protocol:
otlp:
sending_queue:
num_consumers: 100
queue_size: 300
timeout: 30s
tls:
insecure: true
resolver:
dns:
hostname: otelcol-gateway-collector.opentelemetry.svc.cluster.local
routing_key: traceID
prometheus:
add_metric_suffixes: false
endpoint: 0.0.0.0:9090
receivers:
otlp:
protocols:
grpc: {}
http: {}
prometheus/internal:
config:
scrape_configs:
- job_name: otelcol-loadbalancer
scrape_interval: 15s
static_configs:
- targets:
- 0.0.0.0:8888
service:
pipelines:
metrics/prometheus:
exporters:
- prometheus
processors: []
receivers:
- prometheus/internal
traces:
exporters:
- loadbalancing
processors: []
receivers:
- otlp
telemetry:
logs:
level: panic
output_paths:
- stdout
daemonSetUpdateStrategy: {}
deploymentUpdateStrategy: {}
ingress:
route: {}
managementState: managed
mode: deployment
nodeSelector:
node_pool: application
observability:
metrics: {}
podAnnotations:
sidecar.istio.io/proxyCPU: 850m
sidecar.istio.io/proxyCPULimit: 1000m
sidecar.istio.io/proxyMemory: 400Mi
sidecar.istio.io/proxyMemoryLimit: 700Mi
podDisruptionBudget:
maxUnavailable: 1
podSecurityContext:
runAsNonRoot: true
runAsUser: 10001
seccompProfile:
type: RuntimeDefault
replicas: 10
resources:
limits:
cpu: 2500m
memory: 1Gi
requests:
cpu: 1500m
memory: 600Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 10001
seccompProfile:
type: RuntimeDefault
serviceAccount: otelcol-loadbalancer
targetAllocator:
allocationStrategy: consistent-hashing
filterStrategy: relabel-config
observability:
metrics: {}
prometheusCR:
podMonitorSelector: {}
scrapeInterval: 30s
serviceMonitorSelector: {}
resources: {}
upgradeStrategy: automatic
status:
image: otel/opentelemetry-collector-contrib:0.99.0
scale:
replicas: 10
selector: app=otelcol-loadbalancer,app.kubernetes.io/component=opentelemetry-collector,app.kubernetes.io/instance=opentelemetry.otelcol-loadbalancer,app.kubernetes.io/managed-by=opentelemetry-operator,app.kubernetes.io/name=otelcol-loadbalancer-collector,app.kubernetes.io/part-of=opentelemetry,app.kubernetes.io/version=latest,kustomize.toolkit.fluxcd.io/name=opentelemetry-collector,kustomize.toolkit.fluxcd.io/namespace=flux-system
statusReplicas: 10/10
version: 0.99.0
and then for context, this is the v1alpha1
version (can ignore the template functions and be assured that they're being interpolated properly at apply time):
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otelcol-loadbalancer
namespace: opentelemetry
labels:
app: otelcol-loadbalancer
spec:
mode: deployment
nodeSelector:
node_pool: application
podAnnotations:
sidecar.istio.io/proxyCPU: 850m
sidecar.istio.io/proxyCPULimit: 1000m
sidecar.istio.io/proxyMemory: 400Mi
sidecar.istio.io/proxyMemoryLimit: 700Mi
replicas: ${otelcol_loadbalancer_replica_count:=1}
serviceAccount: otelcol-loadbalancer
config: |
receivers:
otlp:
protocols:
grpc:
http:
prometheus/internal:
config:
scrape_configs:
- job_name: otelcol-loadbalancer
scrape_interval: 15s
static_configs:
- targets: [0.0.0.0:8888]
exporters:
prometheus:
endpoint: 0.0.0.0:9090
add_metric_suffixes: false
loadbalancing:
# ensures that spans are consistently routed to
# the same gateway replica based on traceID
routing_key: traceID
protocol:
# all configuration options from the OTLP exporter
# apply here (except the endpoint)
otlp:
timeout: 30s
sending_queue:
queue_size: ${otelcol_loadbalancer_exporter_queue_size:=500}
num_consumers: ${otelcol_loadbalancer_exporter_num_consumers:=10}
tls:
insecure: true
# uses the DNS resolver to find the gateway collector pods
resolver:
dns:
hostname: otelcol-gateway-collector.opentelemetry.svc.cluster.local
service:
pipelines:
traces:
receivers: [otlp]
processors: []
exporters: [loadbalancing]
metrics/prometheus:
receivers: [prometheus/internal]
processors: []
exporters: [prometheus]
telemetry:
logs:
# https://pkg.go.dev/go.uber.org/zap/zapcore#Level
level: ${otelcol_log_level:=panic}
output_paths: [stdout]
resources:
requests:
cpu: ${otelcol_loadbalancer_cpu_req}
memory: ${otelcol_loadbalancer_mem_req}
limits:
cpu: ${otelcol_loadbalancer_cpu_limit}
memory: ${otelcol_loadbalancer_mem_limit}
podSecurityContext:
runAsNonRoot: true
runAsUser: 10001
seccompProfile:
type: RuntimeDefault
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 10001
seccompProfile:
type: RuntimeDefault
It would really help me if someone were able to reproduce this problem in a KiND cluster and post their reproduction steps. I've tried several combinations, including upgrading 0.56.0
-> 0.57.0
-> 0.58.0
, with cert-manager 1.6.1
and 1.14.1
, and always ended up in a valid state with the conversion webhook working correctly.
@parkedwards just to be sure I understand correctly, you upgrade the operator chart to
0.58.2
, and then you get the error when trying to applyv1alpha1.OpenTelemetryCollector
resources? Can you also post yourv1alpha1.OpenTelemetryCollector
resource that you get the error for?Exactly. The webhook TLS configuration is not right. Kubeapi server complains that it cannot load the cert. Cert-manager complains as well and reports the it as unready.
Thing is, this configuration hasn't changed, and as far I can see has always been this way. It'd have to be a bug (as our current use of the cert-manager annotation is correct as per their documentation) in CA injection specifically for CRDs, and likely only affecting specific cert-manager versions, and possibly only a specific platform.
This is why I'm asking for as much information as possible about your environments. I'm hoping to see some commonalities that would let me get to the bottom of things. I'd also really like to see the error logs you and @jan-kantert mentioned.
And on a related note, I see @parkedwards using istio. Anyone else in this thread also in that boat?
It would really help me if someone were able to reproduce this problem in a KiND cluster and post their reproduction steps. I've tried several combinations, including upgrading
0.56.0
->0.57.0
->0.58.0
, with cert-manager1.6.1
and1.14.1
, and always ended up in a valid state with the conversion webhook working correctly.@parkedwards just to be sure I understand correctly, you upgrade the operator chart to
0.58.2
, and then you get the error when trying to applyv1alpha1.OpenTelemetryCollector
resources? Can you also post yourv1alpha1.OpenTelemetryCollector
resource that you get the error for?Exactly. The webhook TLS configuration is not right. Kubeapi server complains that it cannot load the cert. Cert-manager complains as well and reports the it as unready.
Thing is, this configuration hasn't changed, and as far I can see has always been this way. It'd have to be a bug (as our current use of the cert-manager annotation is correct as per their documentation) in CA injection specifically for CRDs, and likely only affecting specific cert-manager versions, and possibly only a specific platform.
This is why I'm asking for as much information as possible about your environments. I'm hoping to see some commonalities that would let me get to the bottom of things. I'd also really like to see the error logs you and @jan-kantert mentioned.
And on a related note, I see @parkedwards using istio. Anyone else in this thread also in that boat?
We use linkerd. I have to check the cert-manager version. We run Kubernetes 1.26 and the latest compatible cert-manager version.
The update breaks certificates for the
ValidatingWebhookConfiguration
andMutatingWebhookConfiguration
.SecretName
in theCertificate
does not match thecert-manager.io/inject-ca-from
in theValidatingWebhookConfiguration
andMutatingWebhookConfiguration
:This is the reason why other people complain that their migrations fails. I am note sure if it is correct that those webhooks only handle
v1alpha1
and notv1beta1
. Are they just for the migration or do we need to worry that they do not work at all?