open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.53k stars 1.48k forks source link

Hostmetrics Receiver throws 'error reading process name' error for process scraper #3004

Closed pranavmarla closed 2 years ago

pranavmarla commented 3 years ago

Describe the bug I'm using the OTEL contrib collector to collect host metrics from an Ubuntu machine. However, when I add the process scraper to the hostmetrics receiver config, it throws an error reading process name ... permission denied error for seemingly every PID in my system.

I haven't specifically reproduced this with the core collector but, since the issue is with the core hostmetrics receiver, I assume the bug is present in the core collector as well.

Steps to reproduce

What did you expect to see? No errors

What did you see instead? Every minute, an error message is generated complaining about error reading process name ... permission denied for seemingly every PID on the machine:

Apr 23 15:34:37 ip-10-249-29-79 otelcontribcol[...]: 2021-04-23T15:34:37.264-0400 error scraperhelper/scrapercontroller.go:206 Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "error": "[error reading process name for pid 1: readlink /proc/1/exe: permission denied; error reading process name for pid 2: readlink /proc/2/exe: permission denied; error reading process name for pid 3: readlink /proc/3/exe: permission denied; ...]"}

What version did you use? v0.24.0 of the contrib collector (otel-contrib-collector_0.24.0_amd64.deb)

What config did you use?

receivers:
  hostmetrics:
    collection_interval: 1m
    scrapers: 
      process:
...

Environment OS: Ubuntu 18.04

Additional context N/A

Jammicus commented 3 years ago

By default I was not able to duplicate this, however with some manual hacking I was.

To replicate this I removed the other users read permission from process 1. Eg sudo chmod u-r /proc/1/exe This resulted in the following:

Apr 25 14:19:23 vagrant otelcontribcol[1912]: 2021-04-25T14:19:23.322Z        error        scraperhelper/scrapercontroller.go:206        Error scraping metrics        {"kind": "receiver", "name": "hostmetrics", "error": "[error reading process name for pid 1: readlink /proc/1/exe: permission denied; error reading process name for pid 2: readlink /proc/2/exe: permission denied; error reading process name for pid 4: readlink /proc/4/exe: .........

Removing that permission seemed to prevent all subsequent processes from being read. A few more debugging steps would be useful if you have time @pranavmarla:

  1. Whats the permissions on process 1 for you? Can get this with sudo ls -la /proc/1/exe
  2. What is the process name? Can get this withsudo cat /proc/1/cmdline, I suspect its /sbin/init
  3. Is this system locked down/have any additional security added to it?
pranavmarla commented 3 years ago

Thanks @Jammicus ! I've added the info you requested below:

  1. > sudo ls -la /proc/1/exe
    lrwxrwxrwx 1 root root 0 Apr 26 17:32 /proc/1/exe -> /lib/systemd/systemd
  2. > sudo cat /proc/1/cmdline
    /sbin/init
  3. This system is an AWS EC2 machine. I'm afraid I don't know if there has been any additional security added to it -- I personally haven't added any.

    > uname -a
    Linux ip-... 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Hope that helps -- let me know if you need any more info!

Jammicus commented 3 years ago

Thanks, managed to reproduce.

Otel user is trying to get information on processes that have been started by other users. If otel had any other processes started it would be the only one you see in the logs.

Switching to the root user instead of otel to run the service (Don't recommend doing this) kind of resolved this, though a new error message decided to show its face:

error reading process name for pid 2: readlink /proc/2/exe: no such file or directory;

Can also replicate by changing the otel user to have a login shell and trying to readlink the exe, which will return a non 0 exit code.

I suspect the original error is ultimately due to permissions, but I did notice that the library in the collector used for this receiver will just return an error straight away compared to some which just continue what I suspect would be a permission error. This makes me wonder if other metrics are throwing permission errors but are being hidden.

Ideally you would not want to run the collector as the root user so I'm not sure what the best steps for this should be. It should also be noted the EXE process in this library has not implemented for OSX so might also cause problems for users on different machines: https://github.com/shirou/gopsutil/tree/539c0d52c022dd05770517674d4f0cd3eaa311da#process-class

ajsaclayan commented 3 years ago

Hi, is there a workaround for this? I'm experiencing this in a Windows environment as well. I'm using version 0.26.0 of the collector.

norbertwnuk commented 3 years ago

I experience similar issue in K8S, the problem is related to inability to read /proc data for APP process(es) by OTEL binary (in other containers on the POD). All the attempts to make it working with shareProcessNamespace, SYS_PTRACE, apparmor unconfined failing, even SYS_ADMIN does not work on GKE. The same time I can access /proc data from console using the same account...

Edit: More careful analysis showing that all per process metrics are gathered (with only shareProcessNamespace & SYS_PTRACE), it seems to be an error in logs only...

image

apiVersion: apps/v1
kind: Deployment
spec:
  ...
  template:
    metadata:
      annotations:
        container.apparmor.security.beta.kubernetes.io/otel-collector: unconfined
    spec:
      shareProcessNamespace: true
      securityContext:
        runAsUser: 0
      containers:
        - name: application
          ...
        - name: otel-collector
          securityContext:
            capabilities:
              add:
                - SYS_PTRACE
"2021-07-14T17:28:18.721Z   ERROR   scraperhelper/scrapercontroller.go:206  Error scraping metrics  {"kind": "receiver", "name": "hostmetrics", "error": "[error reading process name for pid 1: readlink /proc/1/exe: permission denied; error reading username for process \"java\" (pid 7): user: unknown userid 185; error reading username for process \"/bin/bash\" (pid 192): user: unknown userid 185; error reading username for process \"otel-collector\" (pid 228): user: unknown userid 185]"}"
2021-07-14T17:33:51.858Z    ERROR   scraperhelper/scrapercontroller.go:206  Error scraping metrics  {"kind": "receiver", "name": "hostmetrics", "error": "error reading process name for pid 1: readlink /proc/1/exe: permission denied"}
[root@XXX]# readlink /proc/8/exe

/usr/lib/jvm/java-11-openjdk-11.0.10.0.9-4.el8_3.x86_64/bin/java
yuryu commented 3 years ago

Just adding a data point, this issue reproduces on Ubuntu 20.04 as well.

gregbuehler commented 3 years ago

Maybe AppArmor is preventing access? Can you check dmesg for apparmor="DENIED" messages? If you run aa-status do you see AppArmor enabled and if so is the collector in enforce mode?

yuryu commented 3 years ago

I don't see any apparmor="DENIED" in the kernel log. I see this on Debian 11 with the following configurations:

$ sudo aa-status
apparmor module is loaded.
apparmor filesystem is not mounted.
$ sudo systemctl status apparmor
● apparmor.service - Load AppArmor profiles
     Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
     Active: inactive (dead)
       Docs: man:apparmor(7)
             https://gitlab.com/apparmor/apparmor/wikis/home/

And

[    0.100011] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-9-cloud-amd64 root=UUID=<uuid> ro console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 elevator=none apparmor=0 mitigations=off
danielcompton commented 3 years ago

Note that in Google Cloud, this is an expected/non-harmful error (though I think it should still be fixed):

The following logs are examples of non-harmful log spam that you can safely ignore.

codeconsole commented 3 years ago

expected/non-harmful error? The fact that it is generating ~7.5 MB of spam in both my daemon log and syslog per day sure seems harmful to me (not to mention my syslog is now unreadable). That's 450 MB of spam per month per google cloud instance I am running. Not sure how this made it to GA as is and replaced the Legacy agent on google cloud :(

pratheekrebala commented 3 years ago

Hey @codeconsole, FWIW: I dropped the following configuration in rsyslog.d to make my syslog readable (and avoid storing these messages).

/etc/rsyslog.d/99-exclude-otel.conf:

if ($programname contains "otelopscol") {                   
   action(type="omfile" file="/var/log/dropped-msgs.log")
   stop
}

systemctl restart rsyslog.d

This doesn't address the larger issue here but it would make rsyslog.d more usable in the mean time...

yuryu commented 3 years ago

I looked around and it looks like the errors I'm getting are caused by the receiver trying to read exe file names of kernel threads that don't have file names. It was not a permission error. I don't think these cases should be treated as errors.

Whether a certain process is a kernel thread can be checked using the 9th field of the stat file (if 0x00200000 is set) on Linux, however this check is not supported by gopsutil. I think the options include

What do y'all think?

quentinmit commented 2 years ago

I think silently discarding all process scrape errors is the best option. There might be other cases (SELinux, AppArmor, containers, network filesystems, Windows kernel protections, etc.) that cause parts of a process's status to be unreadable. While a user might conceivably want to debug that, it happens on ~all systems for one reason or other so it shouldn't be part of anything that gets logged by default.

qingling128 commented 2 years ago

+1 for mute the errors here (potentially with a flag guard like mute_invalid_process_scraping_error: true if needed): https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/bb386a3b9e59691aa619a1fadb5db4ed56b7955d/receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go#L147.

Braydon from our side is planning to send a PR. Will see what the community thinks.

pranavmarla commented 1 year ago

Note that in Google Cloud, this is an expected/non-harmful error (though I think it should still be fixed):

The following logs are examples of non-harmful log spam that you can safely ignore.

FYI, the above link no longer contains the quoted info. Here's the correct updated link I found: https://cloud.google.com/logging/docs/agent/ops-agent/troubleshoot-run-ingest#log-spam

lalithkumar1196 commented 1 year ago

how did you solve this error?

  1. If i run as a otel user mean im getting, error reading process name for pid 1: readlink /proc/1/exe: permission denied;
  2. If i run as a root user mean im getting, error reading process name for pid 1: readlink /proc/1/exe: no file or directory;
quisse commented 1 year ago

@lalithkumar1196 I'm experiencing the same thing

lalithkumar1196 commented 1 year ago

@quisse i got a solution for that issue

ifalex commented 11 months ago

@lalithkumar1196 which?