Closed vboulineau closed 4 years ago
we are also running into this issue and would like to know if this is us or Rancher...
@vboulineau @heimdull As you mentioned, this issue is an issue due for monitoring solutions like Datadog. I believe the underlying Datadog chart requires this to be set in that particular way.
I'm not sure how much effort it'd be to update it in order to get a Datadog working.
I think you will need to add /sys/fs/cgroup
to the system-volumes
service that both the docker
and console
system services effectively --volumes-from
if I am not mistaken.
@dweomer I added this to my cloud init file: rancher: services: system-volumes: volumes:
and after restarting the nodes it looks like Datadog now has access to what it needs !! Thanks
Hi @heimdull ,
This does not fix the issue for us, could you confirm that you were missing the container metrics from the containers live view page in DataDog ?
We tried doing as you suggested in many ways including:
sudo ros config set rancher.services.system-volumes.volumes [/sys/fs/cgroup:/sys/fs/cgroup]
But we are still missing the metrics and we can see a lot of the following errors:
2020-03-22 12:40:38 UTC | PROCESS | DEBUG | (pkg/util/containers/metrics/cgroup_metrics.go:34 in Mem) | Missing cgroup file: /host/sys/fs/cgroup/memory/docker/d1647bd70a38c6e1dd8975cde6410cad6c96f2be5c60346aad9d4c55f2291e5e/kubepods/besteffort/podf6f278fb-bba2-491d-b61e-653179149451/70759db11e426a0d4323e1229754312f80fa5a5a99b0f0806ad83f20d340f53f/memory.stat 2020-03-22 12:40:38 UTC | PROCESS | DEBUG | (pkg/util/containers/metrics/cgroup_metrics.go:188 in CPU) | Missing cgroup file: /host/sys/fs/cgroup/cpu,cpuacct/docker/d1647bd70a38c6e1dd8975cde6410cad6c96f2be5c60346aad9d4c55f2291e5e/kubepods/besteffort/podf6f278fb-bba2-491d-b61e-653179149451/70759db11e426a0d4323e1229754312f80fa5a5a99b0f0806ad83f20d340f53f/cpuacct.stat
This did resolve the issue for us. We did not see metrics then I added the cgroup to system-volumes and now we have cpu/memory metrics in datadog. I added the setting through my cloud init file and rebuilt all the nodes with the new setting.
I successfully validated the workaround, note that you cannot just overwrite the rancher.services.system-volumes.volumes
key as other volumes are in there too.
I did not find any way to have an auto-merge, so you'd basically need to take default + /sys/fs/cgroup
With latest version this is what the file looks:
rancher:
services:
system-volumes:
volumes:
- /dev:/host/dev
- /etc/docker:/etc/docker
- /etc/hosts:/etc/hosts
- /etc/logrotate.d:/etc/logrotate.d
- /etc/resolv.conf:/etc/resolv.conf
- /etc/ssl/certs/ca-certificates.crt:/etc/ssl/certs/ca-certificates.crt.rancher
- /etc/selinux:/etc/selinux
- /lib/firmware:/lib/firmware
- /lib/modules:/lib/modules
- /run:/run
- /usr/share/ros:/usr/share/ros
- /var/lib/boot2docker:/var/lib/boot2docker
- /var/lib/rancher/cache:/var/lib/rancher/cache
- /var/lib/rancher/conf:/var/lib/rancher/conf
- /var/lib/rancher:/var/lib/rancher
- /var/lib/waagent:/var/lib/waagent
- /var/log:/var/log
- /var/run:/var/run
- /sys/fs/cgroup:/sys/fs/cgroup
@dweomer What about adding this mount in the default configuration?
Hi @vboulineau,
Thanks for your input. Could you clarify the following:
When you say "auto-merge" - do you refer to updating the exiting ros config while the VM is running using ros config export my-config
and then manually update the config and merge it using ros config merge -i my-config
followed by a rebooting of the VM ?
Secondly, did you validate the workaround by creating a new VM with the cloud-init file you pasted or you rebuild/reconfigured an existing VM ?
Thank you very much, this would be extremely helpful for us!!
I did not export but I applied with ros config merge -i my-config
, with my-config
being the file I pasted in previous reply.
And rebooted the VM.
RancherOS Version: (ros os version) 1.5.5
Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.) AWS (Official AMI)
Hello,
Setting up an RKE cluster through Rancher UI with a RancherOS node pool shows that cgroup paths in
/proc/<pid>/cgroup
are mostly incorrect for all containers created by K8S, referring paths that do not exist.Issue does not occur on another node pool running Ubuntu.
For instance, checking PODs managed by Rancher itself:
On node
vboulineau-rancher-worker1
, we can see:Checking cgroup paths for this process, we'll get:
Only the 11:name=systemd one has the correct path. All others are prefixed by
/docker/8ec3e6062e1825067fddaa64bfb839cb47579b4bfe49c4dd9d486ff81c35a479
This path
/sys/fs/cgroup/<group>/docker/8ec3e6062e1825067fddaa64bfb839cb47579b4bfe49c4dd9d486ff81c35a479
does not exist.However, this id does not come from nowhere, it's the container id of the
console
container in thesystem
Docker daemon:So it has probably something to do with the specific way RancherOS runs containers.
I'm not sure it's a bug, but it's definitely causing issues as several monitoring solutions rely on raw cgroup metrics to provide reliable statistics about container workloads.
If you don't believe it's a bug, maybe you'd be able to help understand the underlying mechanisms that explain this behaviour and a way for us to have a reliable way to get cgroups from process.
For reference, on Ubuntu, we get: