open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

[receiver/hostmetrics] permission denied on Linux #20435

Open flenoir opened 1 year ago

flenoir commented 1 year ago

Describe the bug I want to get process metrics of a linux station. So i'm using a collector as an agent with "hostmetrics". When launching the service, i get errors on "process" scraping. the message returns permission denied error for all PIDs.

Steps to reproduce

Being root on the ubuntu system Download v0.74.0 of the contrib collector deb file (otel-contrib-collector_0.74.0_amd64.deb) Install contrib collector: dpkg --install otel-contrib-collector_0.74.0_amd64.deb Configure it to collect host metrics (specifically, process data) via the hostmetrics receiver and process scraper

What did you expect to see? No errors

What did you see instead? Every minute, an error message is generated complaining about error reading process name ... permission denied for seemingly every PID on the machine:

error reading process name for pid 1165232: readlink /proc/1165232/exe: permission denied; error reading process name for pid 1165265: readlink /proc/1165265/exe: permission denied; error reading process name for pid 1166088: readlink /proc/1166088/exe: permission denied; error reading process name for pid 1166634: readlink /proc/1166634/exe: permission denied; error reading process name for pid 1166826: readlink /proc/1166826/exe: permission denied; error reading process name for pid 1166827: readlink /proc/1166827/exe: permission denied; error reading process name for pid 1166874: readlink /proc/1166874/exe: permission denied; error reading process name for pid 1168213: readlink /proc/1168213/exe: permission denied; error reading process name for pid 1168214: readlink /proc/1168214/exe: permission denied; error reading process name for pid 1168221: readlink /proc/1168221/exe: permission denied; error reading process name for pid 1168222: readlink /proc/1168222/exe: permission denied", "scraper": "process"}

What version did you use? v0.74.0 of the contrib collector (https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.74.0/otelcol-contrib_0.74.0_linux_amd64.deb)

What config did you use? config.yaml

extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

receivers:
  otlp:
    protocols:
      grpc:
      http:

  opencensus:
  hostmetrics:
    collection_interval: 30s
    root_path: /
    scrapers:
      cpu:
      memory:
      load:
      filesystem:
      network:
      paging:
      process:
      processes:

  # Collect own metrics
  prometheus:
    config:
      scrape_configs:
      - job_name: 'otel-collector-toto'
        scrape_interval: 10s
        static_configs:
        - targets: ['0.0.0.0:8888']

  jaeger:
    protocols:
      grpc:
      thrift_binary:
      thrift_compact:
      thrift_http:

  zipkin:

processors:
  batch:
  resource: 
    attributes:
    - key: service.name
      value: machine_toto
      action: upsert
    - key: service.namespace
      value: fl001
      action: upsert
    - key: namespace
      value: test_pc_toto0001
      action: upsert
    - key: cluster
      value: linux_ubuntu
      action: upsert
  resourcedetection:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]

exporters:
  logging:
    verbosity: detailed
  otlp/tempo:
    endpoint: [myendpoint-hidden]:80
    tls:
      insecure: true

service:

  pipelines:
    traces:
      receivers: [otlp, opencensus, jaeger, zipkin]
      processors: [batch]
      exporters: [logging, otlp/tempo]

    metrics:
      receivers: [otlp, opencensus, prometheus, hostmetrics]
      processors: [resource, resourcedetection, batch]
      exporters: [logging, otlp/tempo]

  extensions: [health_check, pprof, zpages]

service file

[Unit]
Description=OpenTelemetry Collector Contrib
After=network.target

[Service]
EnvironmentFile=/etc/otelcol-contrib/otelcol-contrib.conf
ExecStart=/usr/bin/otelcol-contrib $OTELCOL_OPTIONS
KillMode=mixed
Restart=on-failure
Type=simple
User=otelcol-contrib
Group=otelcol-contrib

[Install]
WantedBy=multi-user.target

If i add a "sudo" in exec start, or if a chnage User to "root", error changes to :

1163933: readlink /proc/1163933/exe: no such file or directory; error reading process name for pid 1163935: readlink /proc/1163935/exe: no such file or directory; error reading username for process \"gjs\" (pid 1163938): user: unknown userid 1472934163; error reading process name for pid 1164151: readlink /proc/1164151/exe: no such file or directory; error reading process name for pid 1164366: readlink /proc/1164366/exe: no such file or directory; error reading username for process \"brave\" (pid 1165232): user: unknown userid 1472934163; error reading process name for pid 1165263: readlink /proc/1165263/exe: no such file or directory; error reading process name for pid 1165265: readlink /proc/1165265/exe: no such file or directory; error reading username for process \"sudo\" (pid 1166027): user: unknown userid 1472934163; error reading username for process \"grep\" (pid 1166028): user: unknown userid 1472934163; error reading username for process \"sudo\" (pid 1166035): user: unknown userid 1472934163; error reading process name for pid 1166088: readlink /proc/1166088/exe: no such file or directory", "scraper": "process"}

Environment OS: Ubuntu 22.04

Additional context N/A

I also have to mention that i found a closed similar issue which didn't helped me to resolve the problem

github-actions[bot] commented 1 year ago

Pinging code owners for receiver/hostmetrics: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

mx-psi commented 1 year ago

Relates to/duplicates #18923 #18232

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jskiba commented 1 year ago

There are two additional options for process scraper in v0.75.0

mute_process_exe_error: <true|false>
mute_process_io_error: <true|false>

you can use them to mute these errors. This version also allows scraping all processes without dropping processes it could not get exe from.

So I think it can be closed @dmitryax

OmprakashPaliwal commented 1 year ago

@jskiba These two options are not working for me. I am using go.opentelemetry.io/collector/receiver@v0.81.0/.

Please let me know if you need any inputs from my end. I see lots of errors like below

Error scraping metrics        {"kind": "receiver", "name": "hostmetrics/linux/localhost", "data_type": "metrics", "error": "error reading open file descriptor count for process \"systemd\" (pid 1): open /proc/1/fd: permission denied; error reading pending signals for process \"systemd\" (pid 1): open /proc/1/fd: permission denied; error reading open file descriptor count for process \"kthreadd\" (pid 2): open /proc/2/fd: permission denied; error reading pending signals for process \"kthreadd\" (pid 2): open /proc/2/fd: permission denied; error reading open file descriptor count for process \"kworker/0:0H\" (pid 4): open /proc/4/fd: permission denied; error reading pending signals for process \"kworker/0:0H\" (pid 4): open /proc/4/fd: permission denied; error reading open file descriptor count for process \"ksoftirqd/0\" (pid 6): open /proc/6/fd: permission denied; error reading pending signals for process \"ksoftirqd/0\"
andrzej-stencel commented 1 year ago

@OmprakashPaliwal is correct, when you run collector as non-root and enable one of the optional metrics process.open_file_descriptors or process.signals_pending, you get a permission error from the collector process trying to read /proc/[pid]/fd files for processes that are not owned by the user running the collector. As a result, those two metrics are only generated for the processes that are owned by the user running the collector.

The solution is to give the collector process read access to files in /proc/[pid]/fd directories. Unfortunately, regular Linux file permission settings of don't seem to work on files in the /proc directory.

The only way I was able to fix it (other than running the collector as root, which also fixes this issue) is to add the CAP_DAC_READ_SEARCH Linux capability on the collector binary with:

sudo setcap 'cap_dac_read_search=ep' /path/to/the/collector/binary

⚠️ Warning: This capability this gives the collector binary the ability to read any file on the filesystem. See here for examples to exploit this: https://book.hacktricks.xyz/linux-hardening/privilege-escalation/linux-capabilities#cap_dac_read_search

andrzej-stencel commented 1 year ago

Thanks @mx-psi, I closed this accidentally by merging that PR.

To close this issue, I believe we need to make it possible to mute the errors that occur when scraping the process.open_file_descriptors or process.signals_pending metric.

One way to do this is to add another mute_... configuration property to the scraper. There are already three available, and I'm not sure if adding a fourth is a good idea. Also, I'm not sure how it should be named. Should we have separate options for each metric name - mute_open_file_descriptors_error and mute_signals_pending_error?

github-actions[bot] commented 11 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

ringerc commented 8 months ago

On Kubernetes, you'd add the DAC_READ_SEARCH capability flag in the security_context.capabilities. I haven't verified that it resolves the issue though as the workload permissions in my env don't permit that capability flag (for good reasons).

I've tried CAP_DAC_OVERRIDE and it doesn't seem sufficient.

But this is also something the collector should tolerate gracefully. There's no point flooding the log with repeated, predictable errors. The current mute error flags are insufficient, if it's to be simply muted.

mustafa0x commented 5 months ago

There's no point flooding the log with repeated, predictable errors.

This is a problem. journalctl -f -u otelcol-contrib is flooded with these errors. Every 5 seconds 182 lines are written, all of the format error reading disk usage for process "<process-name>" (pid <id>): open /proc/<id>/io: permission denied;