Open Zurvarian opened 2 months ago
Please try v0.108.0
I'll do.
Trying with version v0.108.0 of contrib fixed the issue, thanks!
UPDATE: Sorry, something was misconfigured in my OTEL Collector and that's why it looked like it worked.
It is still failing
@Zurvarian your configuration works for with v0.108.0 when I try it. Please retry your test, ensuring that the version being used is v0.108.0. If you are building a customer collector binary using the Otel Collector Builder, make sure you are using the 0.108.0 version of that as well.
Hello,
I've verified the versions and checked everything and I get the same results. To add more context:
This is the SideCar pod where you can see the Collector distribution and version I'm using (otel/opentelemetry-collector-contrib:0.108.0)
- args:
- --feature-gates=-component.UseLocalHostAsDefaultHost
- --config=env:OTEL_CONFIG
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: OTEL_CONFIG
value: |
receivers:
prometheus_simple:
collection_interval: 10s
endpoint: localhost:9090
use_service_account: true
exporters:
otlp:
endpoint: opentelemetry-collector.monitoring:4317
tls:
insecure: true
processors:
metricstransform:
transforms:
- action: update
include: (.*)$$
match_type: regexp
new_name: knative.dev_eventing.$${1}
service:
pipelines:
metrics:
exporters:
- otlp
processors:
- metricstransform
receivers:
- prometheus_simple
- name: OTEL_RESOURCE_ATTRIBUTES_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: OTEL_RESOURCE_ATTRIBUTES_POD_UID
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.uid
- name: OTEL_RESOURCE_ATTRIBUTES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: OTEL_RESOURCE_ATTRIBUTES
value: k8s.deployment.name=kafka-broker-dispatcher,k8s.deployment.uid=f5ef223e-4ca2-4264-bfe6-26d712629b9e,k8s.namespace.name=knative-eventing,k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME),k8s.pod.uid=$(OTEL_RESOURCE_ATTRIBUTES_POD_UID),k8s.replicaset.name=kafka-broker-dispatcher-796dfff45d,k8s.replicaset.uid=960361fe-ae94-4edf-9dbd-c0e7eff29b72
image: otel/opentelemetry-collector-contrib:0.108.0
imagePullPolicy: IfNotPresent
name: otc-container
ports:
- containerPort: 8888
name: metrics
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-9x4sj
readOnly: true
This is the log I'm getting back:
And this is Collector configuration applied:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"opentelemetry.io/v1beta1","kind":"OpenTelemetryCollector","metadata":{"annotations":{},"name":"knative-eventing-otel-collector","namespace":"knative-eventing"},"spec":{"config":{"exporters":{"otlp":{"endpoint":"opentelemetry-collector.monitoring:4317","tls":{"insecure":true}}},"processors":{"metricstransform":{"transforms":[{"action":"update","include":"(.*)37829","match_type":"regexp","new_name":"knative.dev_eventing.37829{1}"}]}},"receivers":{"prometheus_simple":{"collection_interval":"10s","endpoint":"localhost:9090","use_service_account":true}},"service":{"pipelines":{"metrics":{"exporters":["otlp"],"processors":["metricstransform"],"receivers":["prometheus_simple"]}}}},"mode":"sidecar"}}
creationTimestamp: "2024-08-28T12:37:35Z"
finalizers:
- opentelemetrycollector.opentelemetry.io/finalizer
generation: 2
labels:
app.kubernetes.io/managed-by: opentelemetry-operator
name: knative-eventing-otel-collector
namespace: knative-eventing
resourceVersion: "686767636"
uid: 24f3f157-4696-4dc3-857f-1a1d2b53ff09
spec:
args:
feature-gates: -component.UseLocalHostAsDefaultHost
config:
exporters:
otlp:
endpoint: opentelemetry-collector.monitoring:4317
tls:
insecure: true
processors:
metricstransform:
transforms:
- action: update
include: (.*)$$
match_type: regexp
new_name: knative.dev_eventing.$${1}
receivers:
prometheus_simple:
collection_interval: 10s
endpoint: localhost:9090
use_service_account: true
service:
pipelines:
metrics:
exporters:
- otlp
processors:
- metricstransform
receivers:
- prometheus_simple
configVersions: 3
daemonSetUpdateStrategy: {}
deploymentUpdateStrategy: {}
ingress:
route: {}
ipFamilyPolicy: SingleStack
managementState: managed
mode: sidecar
observability:
metrics: {}
podDnsConfig: {}
replicas: 1
resources: {}
targetAllocator:
allocationStrategy: consistent-hashing
filterStrategy: relabel-config
observability:
metrics: {}
prometheusCR:
scrapeInterval: 30s
resources: {}
upgradeStrategy: automatic
status:
version: 0.107.0
Note that here version says 0.107.0. That's the OpenTelemetry Operator version, the latest as of yesterday. However I'm forcing the version 0.108.0 of the Collector in the Helm installation of the Operator.
Thanks in advance.
Something important to remark is that I'm using the Contrib release as opposite to the default one as I need to make use of the extra addons, like prometheus_simple above.
Could it be that the Contrib version lacks of the fix for some reason? Should I open a ticket in the Contrib repository instead to investigate further?
More testing here:
I did force to use v0.104.0 with the feature gate -confmap.unifyEnvVarExpansion
as recommended in the old ticket and the issue was still there.
Then, I'm using version v0.100.0 and it works.
Adding more context even:
Something is off with the transformer I'm using, when using the version v0.100.0 it does not fail but looking at the debug logs of the collector I can see how the captured value is not appended. Meaning that either my config is wrong somehow or the transformer is unable to capture using regexp :/
I can see the same issue:
Error: failed to get config: cannot resolve the configuration: environment variable "1"
has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/09/02 10:38:25 collector server run finished with error:
failed to get config: cannot resolve the configuration: environment variable "1"
has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
metricstransform/test:
transforms:
- include: test_([^_]+)_(.*)$$
match_type: regexp
action: update
new_name: test.$${1}_$${2}
replacing $${1}
or $${2}
with any normal string works
using otel contrib v.0.108.0, tried to disable feature gate, but at this version its already stable
so cannot be disabled.
@Zurvarian @bart-braidwell What collector distribution are you using? If you are building your own, please confirm your version of ocb and share your manifest file.
@TylerHelmuth I'm using the default contrib: otel/opentelemetry-collector-contrib@v0.108.0
@Zurvarian Are you sure the collector version you tried with is 0.108.0? The operator hasn't been updated yet so you either need to set the image via the OpenTelemetryCollector resource or via the operator startup flags.
@TylerHelmuth let me share my values.yaml that I'm using to install. Originally it was using version 0.107.0 but when you asked me to try 0.108.0 I changed the installation to override the one that came with the template:
opentelemetry-operator:
# Fullname override is required as otherwise the Operator Deployment will be named after the Helm release.
fullnameOverride: cloudprogs-opentelemetry-operator
manager:
collectorImage:
# Contrib collector includes all the community extensions, it is bigger, but we need some of the non-default ones.
repository: otel/opentelemetry-collector-contrib
# Default version for our current release is v0.107.0 but that one has a bug: https://github.com/open-telemetry/opentelemetry-operator/issues/3262
tag: 0.108.0
createRbacPermissions: true
And in the OpenTelemetryCollector SideCar container I can see image: otel/opentelemetry-collector-contrib:0.108.0
@bart-braidwell are you using the OpenTelemetry Operator also?
I was able to reproduce the bug with the Operator's collector, but not with a standard collector. I need to investigate more to understand the root cause.
I was able to install a collector using the collector helm chart with the following config:
receivers:
otlp:
protocols:
grpc: {}
http: {}
processors:
metricstransform:
transforms:
- include: (.*)$$
match_type: regexp
action: update
new_name: knative.dev_eventing.$${1}
exporters:
debug:
verbosity: detailed
service:
pipelines:
metrics:
receivers: [otlp]
processors: [metricstransform]
exporters: [debug]
The same config and collector image used via a OpenTelemetryCollector resource from the operator failed.
The operator is using env var to supply the config, but I tested that delivery method locally and had no issues.
The Operator is definitely doing something here but I haven't found what it is yet. The operator has a release this week to bump to 0.108.0, I will keep investigating after that.
I've been able to reproduce in the collector helm chart using this simplified (core only components) values.yaml:
mode: deployment
resources:
limits:
cpu: 200m
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
image:
repository: otel/opentelemetry-collector-core
tag: "0.108.0"
replicaCount: 1
configMap:
create: false
command:
extraArgs:
- "--config=env:OTEL_CONFIG"
extraEnvs:
- name: OTEL_CONFIG
value: |
receivers:
otlp:
protocols:
grpc:
http:
exporters:
otlp:
endpoint: "api.honeycomb.io:443"
headers:
"X-Honeycomb-Team": "$${1}"
service:
telemetry:
logs:
level: debug
pipelines:
traces:
receivers:
- otlp
exporters:
- otlp
I still cannot reproduce the issue locally with an env var. The following works:
OTEL_CONFIG=$(cat ./bin/config.yaml) ./bin/otelcorecol_darwin_arm64 --config=env:OTEL_CONFIG
Where bin/config.yaml
is
receivers:
otlp:
protocols:
grpc:
http:
exporters:
otlp:
endpoint: "api.honeycomb.io:443"
headers:
"X-Honeycomb-Team": "$${1}"
service:
telemetry:
logs:
level: debug
pipelines:
traces:
receivers:
- otlp
exporters:
- otlp
I've been able to reproduce in the collector helm chart using this simplified (core only components) values.yaml:
Should this be transfered to the helm-charts/operator repo then @TylerHelmuth?
Not yet.
I can see why the problem is happening only in k8s: the input string has its $$
escaped to $
. K8s is escaping the $$
to $
, so the string that is passed to the collector via the OTEL_CONFIG
env var only has ${1}
instead of $${1}
.
I was able to reproduce this behavior outside the collector with a basic app:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
spec:
selector:
matchLabels:
app: my-pod
replicas: 1
template:
metadata:
labels:
app: my-pod
spec:
containers:
- name: myapp
image: alpine:latest
args:
- "env"
env:
- name: OTEL_CONFIG
value: "$$"
When installed will produce logs similar to:
OTEL_CONFIG=$
The bottom of this section specifies that $
, which is a key symbol in k8s env var to allow dependent env vars such as value: $(OTHER_ENV_VAR)
, can be escaped with $$
.
So this is not an issue with confmap, but an issue with how our string interacts with kubernetes env var. I believe the fix is for the Operator to do some massaging of the input string when it creates the env var.
@Zurvarian @bart-braidwell @cloudchristina the workaround for now is to do extra escaping in your OpenTelemetryCollector resource configs to account for both Kubernetes' escaping and the Collector's escaping. For example:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"opentelemetry.io/v1beta1","kind":"OpenTelemetryCollector","metadata":{"annotations":{},"name":"knative-eventing-otel-collector","namespace":"knative-eventing"},"spec":{"config":{"exporters":{"otlp":{"endpoint":"opentelemetry-collector.monitoring:4317","tls":{"insecure":true}}},"processors":{"metricstransform":{"transforms":[{"action":"update","include":"(.*)37829","match_type":"regexp","new_name":"knative.dev_eventing.37829{1}"}]}},"receivers":{"prometheus_simple":{"collection_interval":"10s","endpoint":"localhost:9090","use_service_account":true}},"service":{"pipelines":{"metrics":{"exporters":["otlp"],"processors":["metricstransform"],"receivers":["prometheus_simple"]}}}},"mode":"sidecar"}}
creationTimestamp: "2024-08-28T12:37:35Z"
finalizers:
- opentelemetrycollector.opentelemetry.io/finalizer
generation: 2
labels:
app.kubernetes.io/managed-by: opentelemetry-operator
name: knative-eventing-otel-collector
namespace: knative-eventing
resourceVersion: "686767636"
uid: 24f3f157-4696-4dc3-857f-1a1d2b53ff09
spec:
args:
feature-gates: -component.UseLocalHostAsDefaultHost
config:
exporters:
otlp:
endpoint: opentelemetry-collector.monitoring:4317
tls:
insecure: true
processors:
metricstransform:
transforms:
- action: update
include: (.*)$$
match_type: regexp
new_name: knative.dev_eventing.$$${1}
receivers:
prometheus_simple:
collection_interval: 10s
endpoint: localhost:9090
use_service_account: true
service:
pipelines:
metrics:
exporters:
- otlp
processors:
- metricstransform
receivers:
- prometheus_simple
configVersions: 3
daemonSetUpdateStrategy: {}
deploymentUpdateStrategy: {}
ingress:
route: {}
ipFamilyPolicy: SingleStack
managementState: managed
mode: sidecar
observability:
metrics: {}
podDnsConfig: {}
replicas: 1
resources: {}
targetAllocator:
allocationStrategy: consistent-hashing
filterStrategy: relabel-config
observability:
metrics: {}
prometheusCR:
scrapeInterval: 30s
resources: {}
upgradeStrategy: automatic
status:
version: 0.107.0
@open-telemetry/operator-approvers we could add logic to https://github.com/open-telemetry/opentelemetry-operator/blob/ba2eba68196349a390d56904ff6afbfa79b38af4/internal/manifests/collector/config_replace.go#L87 to replace all instances of $$
with $$$
. This would break users who have already figured out this double escaping hell and are already doing $$$
.
Amazing, thanks! @TylerHelmuth
Describe the bug
Same as with https://github.com/open-telemetry/opentelemetry-collector/issues/10585 but I'm reproducing the issue with version v0.107.0 which supposedly has a fix for it.
Something to notice is that I'm running the Collector as a SideCar created with the Operator, not sure if that affects it somehow.
Steps to reproduce
Create a sidecar with the Operator with a config that has variable replacement of a regexp.
What did you expect to see?
The variable is replaced correctly
What did you see instead?
An error like
cannot resolve the configuration: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
What version did you use?
v0.107.0
What config did you use?
Environment
Linux in GCP.
Additional context