Closed pranavmarla closed 2 years ago
By default I was not able to duplicate this, however with some manual hacking I was.
To replicate this I removed the other users read permission from process 1. Eg sudo chmod u-r /proc/1/exe
This resulted in the following:
Apr 25 14:19:23 vagrant otelcontribcol[1912]: 2021-04-25T14:19:23.322Z error scraperhelper/scrapercontroller.go:206 Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "error": "[error reading process name for pid 1: readlink /proc/1/exe: permission denied; error reading process name for pid 2: readlink /proc/2/exe: permission denied; error reading process name for pid 4: readlink /proc/4/exe: .........
Removing that permission seemed to prevent all subsequent processes from being read. A few more debugging steps would be useful if you have time @pranavmarla:
sudo ls -la /proc/1/exe
sudo cat /proc/1/cmdline
, I suspect its /sbin/initThanks @Jammicus ! I've added the info you requested below:
> sudo ls -la /proc/1/exe
lrwxrwxrwx 1 root root 0 Apr 26 17:32 /proc/1/exe -> /lib/systemd/systemd
> sudo cat /proc/1/cmdline
/sbin/init
This system is an AWS EC2 machine. I'm afraid I don't know if there has been any additional security added to it -- I personally haven't added any.
> uname -a
Linux ip-... 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Hope that helps -- let me know if you need any more info!
Thanks, managed to reproduce.
Otel user is trying to get information on processes that have been started by other users. If otel had any other processes started it would be the only one you see in the logs.
Switching to the root user instead of otel to run the service (Don't recommend doing this) kind of resolved this, though a new error message decided to show its face:
error reading process name for pid 2: readlink /proc/2/exe: no such file or directory;
Can also replicate by changing the otel user to have a login shell and trying to readlink the exe, which will return a non 0 exit code.
I suspect the original error is ultimately due to permissions, but I did notice that the library in the collector used for this receiver will just return an error straight away compared to some which just continue what I suspect would be a permission error. This makes me wonder if other metrics are throwing permission errors but are being hidden.
Ideally you would not want to run the collector as the root user so I'm not sure what the best steps for this should be. It should also be noted the EXE process in this library has not implemented for OSX so might also cause problems for users on different machines: https://github.com/shirou/gopsutil/tree/539c0d52c022dd05770517674d4f0cd3eaa311da#process-class
Hi, is there a workaround for this? I'm experiencing this in a Windows environment as well. I'm using version 0.26.0 of the collector.
I experience similar issue in K8S, the problem is related to inability to read /proc data for APP process(es) by OTEL binary (in other containers on the POD). All the attempts to make it working with shareProcessNamespace
, SYS_PTRACE
, apparmor unconfined
failing, even SYS_ADMIN
does not work on GKE. The same time I can access /proc data from console using the same account...
Edit: More careful analysis showing that all per process metrics are gathered (with only shareProcessNamespace & SYS_PTRACE), it seems to be an error in logs only...
apiVersion: apps/v1
kind: Deployment
spec:
...
template:
metadata:
annotations:
container.apparmor.security.beta.kubernetes.io/otel-collector: unconfined
spec:
shareProcessNamespace: true
securityContext:
runAsUser: 0
containers:
- name: application
...
- name: otel-collector
securityContext:
capabilities:
add:
- SYS_PTRACE
"2021-07-14T17:28:18.721Z ERROR scraperhelper/scrapercontroller.go:206 Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "error": "[error reading process name for pid 1: readlink /proc/1/exe: permission denied; error reading username for process \"java\" (pid 7): user: unknown userid 185; error reading username for process \"/bin/bash\" (pid 192): user: unknown userid 185; error reading username for process \"otel-collector\" (pid 228): user: unknown userid 185]"}"
2021-07-14T17:33:51.858Z ERROR scraperhelper/scrapercontroller.go:206 Error scraping metrics {"kind": "receiver", "name": "hostmetrics", "error": "error reading process name for pid 1: readlink /proc/1/exe: permission denied"}
[root@XXX]# readlink /proc/8/exe
/usr/lib/jvm/java-11-openjdk-11.0.10.0.9-4.el8_3.x86_64/bin/java
Just adding a data point, this issue reproduces on Ubuntu 20.04 as well.
Maybe AppArmor is preventing access? Can you check dmesg
for apparmor="DENIED"
messages? If you run aa-status
do you see AppArmor enabled and if so is the collector in enforce mode?
I don't see any apparmor="DENIED" in the kernel log. I see this on Debian 11 with the following configurations:
$ sudo aa-status
apparmor module is loaded.
apparmor filesystem is not mounted.
$ sudo systemctl status apparmor
● apparmor.service - Load AppArmor profiles
Loaded: loaded (/lib/systemd/system/apparmor.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:apparmor(7)
https://gitlab.com/apparmor/apparmor/wikis/home/
And
[ 0.100011] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-9-cloud-amd64 root=UUID=<uuid> ro console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 elevator=none apparmor=0 mitigations=off
Note that in Google Cloud, this is an expected/non-harmful error (though I think it should still be fixed):
The following logs are examples of non-harmful log spam that you can safely ignore.
- Errors scraping metrics from pseudo-processes or restricted processes -- https://cloud.google.com/logging/docs/agent/ops-agent/troubleshooting#log-spam
expected/non-harmful error? The fact that it is generating ~7.5 MB of spam in both my daemon log and syslog per day sure seems harmful to me (not to mention my syslog is now unreadable). That's 450 MB of spam per month per google cloud instance I am running. Not sure how this made it to GA as is and replaced the Legacy agent on google cloud :(
Hey @codeconsole, FWIW: I dropped the following configuration in rsyslog.d to make my syslog readable (and avoid storing these messages).
/etc/rsyslog.d/99-exclude-otel.conf
:
if ($programname contains "otelopscol") {
action(type="omfile" file="/var/log/dropped-msgs.log")
stop
}
systemctl restart rsyslog.d
This doesn't address the larger issue here but it would make rsyslog.d more usable in the mean time...
I looked around and it looks like the errors I'm getting are caused by the receiver trying to read exe file names of kernel threads that don't have file names. It was not a permission error. I don't think these cases should be treated as errors.
Whether a certain process is a kernel thread can be checked using the 9th field of the stat
file (if 0x00200000 is set) on Linux, however this check is not supported by gopsutil
. I think the options include
gopsutil
to support this checkgopsutil
to ignore readlink errorsWhat do y'all think?
I think silently discarding all process scrape errors is the best option. There might be other cases (SELinux, AppArmor, containers, network filesystems, Windows kernel protections, etc.) that cause parts of a process's status to be unreadable. While a user might conceivably want to debug that, it happens on ~all systems for one reason or other so it shouldn't be part of anything that gets logged by default.
+1 for mute the errors here (potentially with a flag guard like mute_invalid_process_scraping_error: true
if needed): https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/bb386a3b9e59691aa619a1fadb5db4ed56b7955d/receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go#L147.
Braydon from our side is planning to send a PR. Will see what the community thinks.
Note that in Google Cloud, this is an expected/non-harmful error (though I think it should still be fixed):
The following logs are examples of non-harmful log spam that you can safely ignore.
- Errors scraping metrics from pseudo-processes or restricted processes -- https://cloud.google.com/logging/docs/agent/ops-agent/troubleshooting#log-spam
FYI, the above link no longer contains the quoted info. Here's the correct updated link I found: https://cloud.google.com/logging/docs/agent/ops-agent/troubleshoot-run-ingest#log-spam
how did you solve this error?
@lalithkumar1196 I'm experiencing the same thing
@quisse i got a solution for that issue
@lalithkumar1196 which?
Describe the bug I'm using the OTEL contrib collector to collect host metrics from an Ubuntu machine. However, when I add the process scraper to the hostmetrics receiver config, it throws an error reading process name ... permission denied error for seemingly every PID in my system.
I haven't specifically reproduced this with the core collector but, since the issue is with the core hostmetrics receiver, I assume the bug is present in the core collector as well.
Steps to reproduce
sudo -s
v0.24.0
of the contrib collector deb file (otel-contrib-collector_0.24.0_amd64.deb)dpkg --install otel-contrib-collector_0.24.0_amd64.deb
What did you expect to see? No errors
What did you see instead? Every minute, an error message is generated complaining about error reading process name ... permission denied for seemingly every PID on the machine:
What version did you use?
v0.24.0
of the contrib collector (otel-contrib-collector_0.24.0_amd64.deb)What config did you use?
Environment OS:
Ubuntu 18.04
Additional context N/A