kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.56k stars 8.27k forks source link

NGINX: Bump OpenTelemetry. #12371

Open matthias-haase opened 1 week ago

matthias-haase commented 1 week ago

i did:

OPENTELEMETRY_CPP_VERSION="v1.17.0"
perl -pi -e "s/(OPENTELEMETRY_CPP_VERSION=)(.)/\1"$OPENTELEMETRY_CPP_VERSION"/g;" images/nginx/rootfs/build.sh 
OPENTELEMETRY_PROTO_VERSION="v1.3.2" 
perl -pi -e "s/(OPENTELEMETRY_PROTO_VERSION=)(.)/\1"$OPENTELEMETRY_PROTO_VERSION"/g;" images/nginx/rootfs/build.sh 
OPENTELEMETRY_CONTRIB_COMMIT=f6d29426ee9b4d6b476c09ca3cb9bed3cf23906f 
perl -pi -e "s/(OPENTELEMETRY_CONTRIB_COMMIT=)(.)/\1"$OPENTELEMETRY_CONTRIB_COMMIT"/g;" images/nginx/rootfs/build.sh 
perl -pi -e "s/(libprotobuf.)/\1\n abseil-cpp-crc-cpu-detect \/g;" images/nginx/rootfs/Dockerfile

Ingress-NGINX 1.10.0 has dropped support for OpenTracing and Zipkin, favoring OpenTelemetry instead.

The OpenTelemetry module used by Ingress-NGINX is based on a old commit, and has received updates since then.

The correct value is not set according "span->SetStatus(trace::StatusCode::kError);".

Per default it's not correct set with "span->SetStatus(trace::StatusCode::kOk);" if there a trace with error (>=http_code 500).

(in Datadog it's metric trace.nginx.server.errors.)

The changes according Ingress-NGINX 1.11.2 with my branch solved the problem according trace error status: https://github.com/tsimonitoring/ingress-nginx/tree/release-1.11.3-patch-opentelemetry-cpp-and-contrib-and-proto

As example tested on my side in Datadog.

There are correct OPENTELEMETRY_CPP_VERSION, OPENTELEMETRY_PROTO_VERSION, OPENTELEMETRY_CONTRIB_COMMIT in build.sh incl. apk upgrade abseil-cpp-crc-cpu-detect (add) in Dockerfile NGINX.

Before (https://i.imgur.com/LpvotMx.png) there was no shipped metric according error_status per OpenTelemetry Module.

After (https://i.imgur.com/xvz6b05.png) you can see the shipped error metric also in trace view or see diag example (https://i.imgur.com/xEEY2Ep.png).

## What this PR does / why we need it:

Types of changes

Which issue/s this PR fixes

fixes # The correct value is not set according "span->SetStatus(trace::StatusCode::kError);".

How Has This Been Tested?

in azure kubernetes with test metric in datadog

Checklist:

k8s-ci-robot commented 1 week ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matthias-haase Once this PR has been reviewed and has the lgtm label, please assign cpanato for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes/ingress-nginx/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
k8s-ci-robot commented 1 week ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 1 week ago

Hi @matthias-haase. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
netlify[bot] commented 1 week ago

Deploy Preview for kubernetes-ingress-nginx canceled.

Name Link
Latest commit dadb6027bb49be2615e834fe914da85fc02dca1e
Latest deploy log https://app.netlify.com/sites/kubernetes-ingress-nginx/deploys/673746d71497ca0008f83be5
strongjz commented 1 week ago

/ok-to-test

strongjz commented 1 week ago

/kind feature /priority backlog

strongjz commented 1 week ago

Why is abseil-cpp-crc-cpu-detect needed?

matthias-haase commented 1 week ago

@strongjz abseil-cpp-crc-cpu-detect in Dockerfile needed by shared object.

otel_ngx_module.so -> libopentelemetry_exporter_otlp_grpc.so -> libabsl_crc_cpu_detect.so.2308.0.0

Error message clearify this - without libabsl_crc_cpu_detect.so.2308.0.0 you get the errormessage:

│ -------------------------------------------------------------------------------                                                                                         │
│   Warning  RELOAD  14s (x16 over 64s)  nginx-ingress-controller  (combined from similar events): Error reloading NGINX:                                                 │
│ -------------------------------------------------------------------------------                                                                                         │
│ Error: exit status 1                                                                                                                                                    │
│ 2024/10/17 13:38:37 [emerg] 49#49: dlopen() "/etc/nginx/modules/otel_ngx_module.so" failed (Error loading shared library libabsl_crc_cpu_detect.so.2308.0.0: No such fi │
│ le or directory (needed by /usr/local/lib/libopentelemetry_exporter_otlp_grpc.so)) in /tmp/nginx/nginx-cfg3367704967:7                                                  │
│ nginx: [emerg] dlopen() "/etc/nginx/modules/otel_ngx_module.so" failed (Error loading shared library libabsl_crc_cpu_detect.so.2308.0.0: No such file or directory (nee │
│ ded by /usr/local/lib/libopentelemetry_exporter_otlp_grpc.so)) in /tmp/nginx/nginx-cfg3367704967:7                                                                      │
│ nginx: configuration file /tmp/nginx/nginx-cfg3367704967 test failed                                                                                                    │
│                                                                                                                                                                         │
│ -------------------------------------------------------------------------------     

here are the repos, which work "together" OPENTELEMETRY_CPP_VERSION -> https://github.com/open-telemetry/opentelemetry-cpp/releases

+export OPENTELEMETRY_CPP_VERSION="v1.17.0"

OPENTELEMETRY_PROTO_VERSION -> https://github.com/open-telemetry/opentelemetry-proto/releases

+export OPENTELEMETRY_PROTO_VERSION="v1.3.2"

OPENTELEMETRY_CONTRIB_VERSION -> https://github.com/open-telemetry/opentelemetry-cpp-contrib/releases

+export OPENTELEMETRY_CONTRIB_COMMIT=f6d29426ee9b4d6b476c09ca3cb9bed3cf23906f

OPENTELEMETRY_CONTRIB_COMMIT is the newest commit - an no new version tag is established here. :(

Answer: OPENTELEMETRY_CONTRIB_COMMIT creates libopentelemetry_exporter_otlp_grpc.so, which needs abseil-cpp-crc-cpu-detect. abseil-cpp-crc-cpu-detect installs the needed libabsl_crc_cpu_detect.so.2308.0.0

Proof:

src|opentelemetry-cpp-contrib.git $ find . -type f|xargs grep libopentelemetry_exporter_otlp_grpc
./opentelemetry-cpp-contrib.git/instrumentation/otel-webserver-module/build.gradle:    from("${modDepDir}/opentelemetry/${cppSDKVersion}/lib/libopentelemetry_exporter_otlp_grpc.so") { it.into "sdk_lib/lib" }
./opentelemetry-cpp-contrib.git/instrumentation/otel-webserver-module/opentelemetry_module.conf:LoadFile /opt/opentelemetry-webserver-sdk/sdk_lib/lib/libopentelemetry_exporter_otlp_grpc.so

The other repos do not have a entry like "libopentelemetry_exporter_otlp_grpc".

@strongjz in hope this helps push a faster commit, because:

With azure kubernetes version 1.31 there is a needed pressure using newsest nginx.

Problem: newest nginx uses opentelemetry instead opentracing, but trace status error is not shipping in a correct way.

Due to need use correct monitoring with tracing there's a stop according go to newer versions with nginx.

An this stopps using next kubernetes version 1.31, which makes the pressure from azure.

That's why i created the pull request: https://github.com/kubernetes/ingress-nginx/pull/12371

How can someone push and go to a faster with minimal time delay integrate such a change of 3 lines in build.sh and 1 in Dockerfile in images/nginx/rootfs/ ? Can you help? THX a lot all for help!

Gacko commented 6 days ago

Also we are currently in the process of releasing v1.12. This change won't make it in there and will earliest be included in v1.13. Additionally we are currently working on bumping the NGINX to OpenResty v1.27, so let's just postpone this here until we bumped NGINX itself and try to integrate it based on that.

matthias-haase commented 6 days ago

Can you please come up with a more descriptive PR title? This goes into the commit message on main branch and the changelog on release. Also we prefer tagged releases over just picking latest.

Hello, i'm a newbie. This is my first PR. But not my last. I'm open source fan and will give further more PR's, if i can help. can you help me and suggest a better title incl. info what i must do for change PR title - bot message or what ever, i do not know , what i can do :( ? THX a lot ! I suggest another title like this: 'This fix resolves shipping correct value in traces according "span->SetStatus(trace::StatusCode::kError);".' Ok or any hint ?

matthias-haase commented 6 days ago

Also we are currently in the process of releasing v1.12. This change won't make it in there and will earliest be included in v1.13. Additionally we are currently working on bumping the NGINX to OpenResty v1.27, so let's just postpone this here until we bumped NGINX itself and try to integrate it based on that.

Is there a way also update release v1.10 and v1.11 and 1.12, because there are a lot of deployments using a "buggy" opentelemetry module with not correct shipped trace error status? Can i create new PR's on every v1.10 and v1.11 and 1.12 like i did on main branch ?

Background:

This would helps test with current deploments in current kubernetes deployments and help go forward to new versions without risk in e2e-tests, incl. possibility if there's a "go back", you cab use the patch with older versions.

Hint: With azure kubernetes version 1.31 there is a needed pressure using newest nginx. Problem: newest nginx uses opentelemetry instead opentracing, but trace status error is not shipping in a correct way. Due to need use correct monitoring with tracing there's a stop according go to newer versions with nginx. That's why i created the pull request: https://github.com/kubernetes/ingress-nginx/pull/12371

How can someone push and go to a faster with minimal time delay integrate such a change of 3 lines in build.sh and 1 in Dockerfile in images/nginx/rootfs/ incl. the v1.10 and v1.11 and 1.12 ?

THX for hint according older and current versions. You are the best!

Gacko commented 6 days ago

Please do not file separate PRs on different branches. Back-porting changes is up to the maintainers of this project.

Also I'd like to note that at least I'm feeling a little pushed by you. I understand this might be urgent to you (or your employer), but still we are all doing this in our free time and are responsible for maintaining changes brought to us by contributors. So I'd like to ask you for patience while we are reviewing your proposal thoroughly.

Gacko commented 6 days ago

One additional note, also to other maintainers: The compilation of the NGINX base image is still broken at the moment. I'd highly appreciate not merging any changes to it as long as it hasn't been fixed. This just makes it more complicated.

matthias-haase commented 6 days ago

Please do not file separate PRs on different branches. Back-porting changes is up to the maintainers of this project.

Also I'd like to note that at least I'm feeling a little pushed by you. I understand this might be urgent to you (or your employer), but still we are all doing this in our free time and are responsible for maintaining changes brought to us by contributors. So I'd like to ask you for patience while we are reviewing your proposal thoroughly.

Yes of course. You're right. Thank you review the pull request.

tsimonitoring commented 3 days ago

@Gacko

This fix resolves shipping correct value in traces according "span->SetStatus(trace::StatusCode::kError);"

Fix renamed. Ok ?

Gacko commented 3 days ago

Yes, sure. As already mentioned we cannot merge this for now as we still have some other changes on the NGINX base image in the pipeline. Thank you for your contribution so far!

Gacko commented 3 days ago

I updated the title as this is what you're actually doing. The exact intention can be found in the details, but the effect to users is literally just the bump.

matthias-haase commented 3 days ago

@Gacko

I updated the title as this is what you're actually doing. The exact intention can be found in the details, but the effect to users is literally just the bump.

Thank you. Do i have to do now everything ? ...or all just fine and commit will merged wihtout any doing from my side ? Sorry, i'm a newbie.

I ask, because i see a line: 1 change requested Gacko requested changes

Gacko commented 3 days ago

We will take care from here. As stated earlier we first need to fix the NGINX compilation anyway.

matthias-haase commented 3 days ago

@Gacko

We will take care from here. As stated earlier we first need to fix the NGINX compilation anyway.

is there a link to a log? can i help according the problem?

Gacko commented 3 days ago

There's a log, sure, but it's basically just an architecture / cross-compiling issue we need to figure out. Already had this in the past.

Anyway, just for clarification and to not disappoint any expectations: Our current roadmap and priorities look as follows:

  1. Fix NGINX build.
  2. Get v1.12 released.
  3. Merge other PRs related to the NGINX base image itself (like bumping it to v1.27.1).
  4. Rebase & merge this PR on it.
  5. Somewhen in the future: Release a v1.13 which also includes this.

So there is currently no hurry to get this PR merged as we have other priorities for now and probably won't have in v1.12.