open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.15k stars 413 forks source link

Python autoinstrumentation for musl libc based application containers #2264

Open ilyamochalov opened 10 months ago

ilyamochalov commented 10 months ago

Component(s)

instrumentation

Is your feature request related to a problem? Please describe.

Python autoinstrumentation for musl libc based application containers fails with the following error:

#16 2.190 ImportError: Error relocating /autoinstrumentation/psutil/_psutil_linux.abi3.so: __sched_cpufree: symbol not found
#16 2.191 Failed to auto initialize opentelemetry
#16 2.191 Traceback (most recent call last):
#16 2.191   File "/autoinstrumentation/opentelemetry/instrumentation/auto_instrumentation/sitecustomize.py", line 39, in initialize
#16 2.191     _load_instrumentors(distro)
#16 2.191   File "/autoinstrumentation/opentelemetry/instrumentation/auto_instrumentation/_load.py", line 91, in _load_instrumentors
#16 2.191     raise exc
#16 2.191   File "/autoinstrumentation/opentelemetry/instrumentation/auto_instrumentation/_load.py", line 87, in _load_instrumentors
#16 2.191     distro.load_instrumentor(entry_point, skip_dep_check=True)
#16 2.191   File "/autoinstrumentation/opentelemetry/instrumentation/distro.py", line 62, in load_instrumentor
#16 2.191     instrumentor: BaseInstrumentor = entry_point.load()
#16 2.191   File "/autoinstrumentation/pkg_resources/__init__.py", line 2518, in load
#16 2.191     return self.resolve()
#16 2.191   File "/autoinstrumentation/pkg_resources/__init__.py", line 2524, in resolve
#16 2.191     module = __import__(self.module_name, fromlist=['__name__'], level=0)
#16 2.191   File "/autoinstrumentation/opentelemetry/instrumentation/system_metrics/__init__.py", line 79, in <module>
#16 2.191     import psutil
#16 2.191   File "/autoinstrumentation/psutil/__init__.py", line 102, in <module>
#16 2.191     from . import _pslinux as _psplatform
#16 2.191   File "/autoinstrumentation/psutil/_pslinux.py", line 25, in <module>
#16 2.191     from . import _psutil_linux as cext
#16 2.191 ImportError: Error relocating /autoinstrumentation/psutil/_psutil_linux.abi3.so: __sched_cpufree: symbol not found

Root cause: current autoinstrumentation build packaged for BSD libc.

Describe the solution you'd like

  1. Add an extra build stage to alpine base image at https://github.com/open-telemetry/opentelemetry-operator/blob/v0.87.0/autoinstrumentation/python/Dockerfile#L12
  2. Copy instrumentation library into final image into a separate path: https://github.com/open-telemetry/opentelemetry-operator/blob/v0.87.0/autoinstrumentation/python/Dockerfile#L22
  3. Add extra annotation instrumentation.opentelemetry.io/otel-python-auto-runtime: \"linux-musl-x64\""
  4. Update https://github.com/open-telemetry/opentelemetry-operator/blob/main/pkg/instrumentation/python.go to facilitate changes need to load copy and load correct dependencies

Describe alternatives you've considered

No response

Additional context

Similar change was made for .Net

TylerHelmuth commented 10 months ago

Unlike dotnet, I believe this is a fault of the docker image we supply, not the instrumentation itself.

@open-telemetry/operator-approvers I think we need to make a concrete decision on what auto-instrumentation images we supply. For all appropriate languages, will will supply both musl and glibc based images? Or is dotnet a one-off case because of how the dotnet agent is supplied?

ilyamochalov commented 10 months ago

@TylerHelmuth thank you for checking this issue.

psutil_linux.abi3.so: __sched_cpufree: symbol not found and similar error messages indicate that psutil package (which is a dependency of Python OTel packages) was installed against a system with different C lib implementation (Glibc vs Musl). When pip installing psutil CPython compiles something against C lib. Pip dependencies compiled against Glibc won't work on Musl systems

Final autoinstrumentation images for .NET, Python, and other languages are simply one way to distribute programming language-specific auto-instr libraries. I think for languages which runtime depend on system C Lib we need to build auto-instr libraries against both Glibc and Musl libraries and bring both sets of artifacts to application. Then OTel Kubernetes operator should make a decision about what artifact needs to be injected into the app container.

TylerHelmuth commented 10 months ago

We discussed this issue during the SIG call today. We'd like to have a clean solution that auto-detects which libs to use and handles everything for the user, but we think finding a solution like that is unlikely.

Most likely we have to implement a dotnet-like solution where the user can specify the libs they need.

@srikanthccv do you or any other Python maintainers have any advice on how to handle this?

srikanthccv commented 10 months ago

I took a brief look at the dotnet solution. I think the same should work for Python as well. I will take some time to review the instrumentation side and see if there are any cases that require special handling.

ilyamochalov commented 10 months ago

@srikanthccv thank you for taking a look. I will proceed with my PR proposing changes to operator and instr docker image (please review dockerfile on the PR link above)

ilyamochalov commented 10 months ago

@open-telemetry/operator-approvers PR is ready, can someone review it please https://github.com/open-telemetry/opentelemetry-operator/pull/2266?

pmcollins commented 2 months ago

Bumped into the psutil stacktrace issue while exploring python autoinstrumentation as defined by the files in the e2e-instrumentation/instrumentation-python directory.

Looks like the dockerfile for the default init container and for the test app (published at ghcr.io/open-telemetry/opentelemetry-operator/e2e-test-app-python:main) use binary incompatible base images -- one uses python3.11 (glibc) and the other alpine.318 (musl).

pmcollins commented 2 months ago

Also, the collector configs defined in the instrumentation directories (e.g. tests/e2e-instrumentation/instrumentation-python/00-install-collector.yaml) don't specify a metrics receiver, but python auto-instrumentation sends metrics, so you get a 404 in the logs because of the failed metrics exports. Adding a metrics receiver to the collector pipeline solves the problem.