google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.11k stars 2.32k forks source link

cadvisor prevents docker from removing monitored containers? #771

Open cornelius-keller opened 9 years ago

cornelius-keller commented 9 years ago

Hi all, I have a problem using cadvisor on centos 7. When cadvisor is running, docker failes to remove other containers saying that the containers filesystem is busy. After stopping cadvisor is stopped container removal is working again.

I demostrated that in this gist: https://gist.github.com/cornelius-keller/0fd2d23b68ccd88c9328

I also included os version and docker info in the gist.

rjnagal commented 9 years ago

Thanks for reporting, @cornelius-keller

what cadvisor version are you running? Can you get host:port/validate for cadvisor? Is this a temporary situation, or does the container fs stays busy till you delete cadvisor?

cornelius-keller commented 9 years ago

@rjnagal Cadvisor version is:

[root@583274-app35 ~]# docker images
REPOSITORY                                      TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
docker.io/google/cadvisor                       latest              399ae3c46a0e        47 hours ago        19.89 MB
[root@583274-app35 ~]# 

This is a permanent situation. The container fs stays busy untill I delete cadvisor.

What do you mean by getting host:port/validate for cadvisor? Cadvisor was still running and responsive on the web ui if that is what you mean. Unfortunately I can't give you any public host port to validate as cadvisor is only exposed via a vpn.

rjnagal commented 9 years ago

Yeah, I just need the ouput from /validate endpoint on cadvisor UI. You can scrub any data that's private in there. Thanks

On Fri, Jun 12, 2015 at 9:54 AM, Cornelius Keller notifications@github.com wrote:

@rjnagal https://github.com/rjnagal Cadvisor version is:

[root@583274-app35 ~]# docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZEdocker.io/google/cadvisor latest 399ae3c46a0e 47 hours ago 19.89 MB [root@583274-app35 ~]#

This is a permanent situation. The container fs stays busy untill I delete cadvisor.

What do you mean by getting host:port/validate for cadvisor? Cadvisor was still running and responsive on the web ui if that is what you mean. Unfortunately I can't give you any public host port to validate as cadvisor is only exposed via a vpn.

— Reply to this email directly or view it on GitHub https://github.com/google/cadvisor/issues/771#issuecomment-111555689.

cornelius-keller commented 9 years ago

Sorry was a long day, did not get that this was an endpoint. I added the output to the gist.

gianlucaborello commented 9 years ago

I am facing this same issue. Essentially, running cadvisor with --volume=/:/rootfs:ro causes other containers' devicemapper mounts to be mounted inside the cadvisor container, so they can't be properly destroyed when issuing docker rm on the target container as they will appear in use.

How can this be solved?

hoeghh commented 9 years ago

When i run it on Fedora 21, it works fine. But when i run it on Ubuntu 14.04.2 LTS I get the same error as described above.

Error response from daemon: Cannot destroy container xxx_jenkinsMaster_1230: Driver aufs failed to remove root filesystem 13b421d0458e740e42e5fa5ac1cb68f32638f0bc723d9ba16718955214d79b7d: rename /var/lib/docker/aufs/mnt/13b421d0458e740e42e5fa5ac1cb68f32638f0bc723d9ba16718955214d79b7d /var/lib/docker/aufs/mnt/13b421d0458e740e42e5fa5ac1cb68f32638f0bc723d9ba16718955214d79b7d-removing: device or resource busy

The main difference is, that Ubuntu uses AUFS, where Fedora uses Devicemapper. Maby thats the problem.

shredder12 commented 9 years ago

@rjnagal I can confirm that this issue happens on Ubuntu trusty x64 with Doceker 1.8.1, cadvisor:latest and devicemapper.

'1cb6051b30a1' being the container ID.

# grep -l 1cb6051b30a1 /proc/*/mountinfo
/proc/1963/mountinfo
# ps aux | grep -i 1963
root      1963  1.9  0.8 588740 71688 ?        Ssl  Aug26  30:08 /usr/bin/cadvisor
root     14767  0.0  0.0  11744   952 pts/0    S+   00:56   0:00 grep --color=auto -i 1963

Please suggest a workaround for this.

difro commented 9 years ago

same here with CentOS + Docker 1.8.1(devicemapper)

Had to remove --volume=/:/rootfs:ro && --volume=/var/lib/docker:/var/lib/docker:ro

vishh commented 9 years ago

@rjnagal: Excepting disk usage calculation, cAdvisor does not poke at any of these directories right?

On Fri, Aug 28, 2015 at 12:26 AM, Jihoon Chung notifications@github.com wrote:

same here with CentOS + Docker 1.8.1(devicemapper)

Had to remove --volume=/:/rootfs:ro && --volume=/var/lib/docker:/var/lib/docker:ro

— Reply to this email directly or view it on GitHub https://github.com/google/cadvisor/issues/771#issuecomment-135661164.

hourliert commented 9 years ago

Same problem here with Ubuntu 14.04.3.

@difro solution works but cadvisor can't provide docker stats anymore.

Any workaround?

rmetzler commented 9 years ago

The last time I ran into this problem, I digged a little bit into the cAdvisor source code. I'm not 100% sure - because it was a few weeks ago - but this is essentially the gist:

If you use cAdvisor like it is shown in README.md you'll mount /var/lib/docker as a volume into the container. This will create dead containers.

The reason, cAdvisor wants you to mount /var/lib/docker is - as far as I could see - only to display a certain info that is only interesting for admins and should be known before hand.

jimmidyson commented 9 years ago

We should be able to get all info from a docker inspect rather than parsing the container config file. Seems like mounting /var/lib/docker is causing more trouble than it's worth.

svenmueller commented 9 years ago

we also encounter the same problem (cadvisor:lastest, ubuntu 14.04)

svenmueller commented 8 years ago

any updates regarding this?

vishh commented 8 years ago

The best we can do for now is to let users optionally disable filesystem usage metrics. We are waiting for some of the new upstream kernel features to simplify disk accounting.

On Tue, Jan 26, 2016 at 2:51 PM, Sven Müller notifications@github.com wrote:

any updates regarding this?

— Reply to this email directly or view it on GitHub https://github.com/google/cadvisor/issues/771#issuecomment-175277349.

tuxknight commented 8 years ago

Same situation. My Docker Version is 1.9.1 Cadvisor version 0.18.0

And when docker rm container fails, the status of that container change to "dead" . Is it possible to umount that specific mountpoint when container status changed to "exit" or "dead" ?

arhea commented 8 years ago

+1

vishh commented 8 years ago

cAdvisor doesn't mount anything. It runs du periodically to collect filesystem stats. Other than that, it does not touch the container's filesystem at all. The easy fix for this would be to retry docker deletion or disable filesystem aggregation in cadvisor.

On Wed, Feb 3, 2016 at 2:57 PM, Alex Rhea notifications@github.com wrote:

+1

— Reply to this email directly or view it on GitHub https://github.com/google/cadvisor/issues/771#issuecomment-179518025.

maybetonyfu commented 8 years ago

running cAdvisor without --volume=/:/rootfs:ro seems to fix it. As pointed out in https://github.com/google/cadvisor/blob/master/docs/running.md I haven't fully tested it yet, but works fine up to now

xbglowx commented 8 years ago

I had to remove the following volume mounts:

Setup:

xbglowx commented 8 years ago

Upgraded docker to 1.10.3 and now cAdvisor can only see the docker images, but no containers, if I only use volume mounts:

If I add /:/rootfs:ro, cAdvisor can see the containers, but I get device or resource busy, when trying to remove any container.

vishh commented 8 years ago

@xbglowx Are you using the latest cadvisor release?

xbglowx commented 8 years ago

Using google/cadvisor:v0.22.0

jordic commented 8 years ago

Any ideas or suggestions how can i dig inside the issue?

vishh commented 8 years ago

cc @timstclair

timstclair commented 8 years ago

I was able to reproduce this locally with docker v1.9.1 and cAdvisor 0.22.0, but only right after starting cAdvisor and only once (removing a second container works). I could not reproduce with docker v1.11.

Is this consistent with everyone else's experience?

jordic commented 8 years ago

With docker 1.11.1 the is issue is gone. With the latest fixes from docker part, seems working now.

ashkop commented 8 years ago

I'm still able to reproduce this with docker 1.11.1 and cAdvisor 0.23.0. Ubuntu 14.04.

vishh commented 8 years ago

@ashkop Can you try running cAdvisor with --disable_metrics="tcp,disk" and see if that resolves the issue? Note that you will not get docker container filesystem metrics by adding this flag.

xbglowx commented 8 years ago

If I try using --disable_metrics="tcp,disk" I get the following:

sudo docker run -ti -v /var/lib/docker/:/var/lib/docker:ro -v /var/run:/var/run:rw -v /sys:/sys:ro -v /:/rootfs:ro google/cadvisor --disable_metrics="tcp,disk"
panic: assignment to entry in nil map

goroutine 1 [running]:
panic(0xb0c8c0, 0xc8201c0440)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
main.(*metricSetValue).Set(0x15ac528, 0x7ffe3cea1f59, 0x8, 0x0, 0x0)
    /go/src/github.com/google/cadvisor/cadvisor.go:85 +0x1da
flag.(*FlagSet).parseOne(0xc82004e060, 0xc82005e901, 0x0, 0x0)
    /usr/local/go/src/flag/flag.go:881 +0xdd9
flag.(*FlagSet).Parse(0xc82004e060, 0xc82000a100, 0x2, 0x2, 0x0, 0x0)
    /usr/local/go/src/flag/flag.go:900 +0x6e
flag.Parse()
    /usr/local/go/src/flag/flag.go:928 +0x6f
main.main()
    /go/src/github.com/google/cadvisor/cadvisor.go:99 +0x68

This is with cAdvisor version 0.23.0 (750f18e). Works fine with 0.22.0.

I still need to see if using --disable_metrics="tcp,disk" fixes the problem.

timstclair commented 8 years ago

Yeah, that was fixed in https://github.com/google/cadvisor/pull/1259, but it's not integrated into any release.

ashkop commented 8 years ago

@vishh Unfortunately the flag didn't help. As @xbglowx mentioned, this option causes 0.23.0 to crash, so I tried 0.22.0 and canary. Both still prevent me from removing containers. Here's the error message I get:

Error response from daemon: Unable to remove filesystem for 9e96817fba0a443f75d1426b6d7a586f4bc84217b06eb021f6d28bae4f341473: remove /var/lib/docker/containers/9e96817fba0a443f75d1426b6d7a586f4bc84217b06eb021f6d28bae4f341473/shm: device or resource busy

infiniteproject commented 8 years ago

Same here on Debian 8, Docker 1.11.1 and latest cAdvisor.

vishh commented 8 years ago

@timstclair Can we make a v0.23.1 release with the fix for --disable_metrics flag?

moortimis commented 8 years ago

I am experiencing the same issue with the following versions

"cAdvisor version: 0.23.0-750f18e" google/cadvisor latest 5cda8139955b 8 days ago 48.92 MB

CentOS Linux release 7.2.1511 (Core) Docker version 1.11.1, build 5604cbe

Work around was to remove /var/lib/docker from the shared volume.

rjnagal commented 8 years ago

@vishh Is this fixed if we just stopped tracking disk metrics for these machines? Are there other dependencies?

vishh commented 8 years ago

@rjnagal Disk metrics should be the only dependency. Disabling that by using --disable_metrics=tcp,disk should fix this issue.

rjnagal commented 8 years ago

Can we do that by default when we detect devicemapper?

vishh commented 8 years ago

@rjnagal AFAIK, it is not limited to devicemapper alone. AUFS is also affected. If we need a default solution, we will have to disable per-container disk metrics by default.

ceecko commented 8 years ago

The issue persists in v0.23.1 on CentOS7, Docker 1.10.1, devicemapper

docker run \
  --rm \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  google/cadvisor:v0.23.1 \
  -docker_only \
  --disable_metrics="tcp,disk"
ceecko commented 8 years ago

To add more info - the issue persists on v0.23.1 and v0.23.2 on CentOS7, Docker 1.11.1, devicemapper.

However the issue only occurs when cadvisor is run from docker. Running cadvisor directly on CentOS7 works without issues.

timstclair commented 8 years ago

Could you add more details about your repro steps? How many containers are you running, with what options? It would help if we could reproduce from a clean VM centos image.

ashkop commented 8 years ago

I tried to reproduce it on fresh VM, but failed. I'll try to find the difference that is actually causing the issue. Meanwhile I did lsof inside the cadvisor container of the file that is being blocked. Here's what I got:

1   /usr/bin/cadvisor   pipe:[70918923]
1   /usr/bin/cadvisor   pipe:[70918924]
1   /usr/bin/cadvisor   pipe:[70918925]
1   /usr/bin/cadvisor   socket:[70919220]
1   /usr/bin/cadvisor   anon_inode:[eventpoll]
1   /usr/bin/cadvisor   anon_inode:inotify
1   /usr/bin/cadvisor   socket:[70919240]
ashkop commented 8 years ago

I also noticed that issue occurs only if I start cadvisor after my own containers. If cadvisor is the first one started, then I can restart my containers without any issue.

ceecko commented 8 years ago

@ashkop That's actually correct. I tried to reproduce the error, but couldn't. If the other containers are started first, only then cadvisor blocks removal.

ceecko commented 8 years ago

Here's a script to replicate the error on CentOS 7. You will need a machine with an empty block device (just replace the path to the device in DOCKER_DATA_DISK) and it will setup docker with devicemapper through lvm's thin-pool, run a container, then cadvisor and then stop & rm the first container.

#!/bin/bash

DOCKER_DATA_DISK=/dev/vdb

set -exo pipefail

setenforce Permissive

yum update -y
yum install -y lvm2

systemctl enable lvm2-lvmetad
systemctl start lvm2-lvmetad

pvcreate $DOCKER_DATA_DISK
vgcreate data $DOCKER_DATA_DISK
lvcreate -l 100%free -T data/docker_thin

curl -sSL https://get.docker.com/ | sh

mkdir -p /etc/systemd/system/docker.service.d
cat <<EOF > /etc/systemd/system/docker.service.d/docker-lvm.conf
[Service]
ExecStart=
ExecStart=/usr/bin/docker daemon -H fd:// \
    -s devicemapper \
    --storage-opt dm.thinpooldev=/dev/mapper/data-docker_thin

TimeoutStartSec=3000
EOF

systemctl daemon-reload
systemctl enable docker
systemctl start docker

sleep 3

docker run \
    --name=test \
    -d \
    debian:jessie \
    /bin/sh -c "while true; do foo; sleep 1; done"

docker run \
  -d \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --name=cadvisor \
  google/cadvisor:v0.23.1 \
  -docker_only \
  --disable_metrics="tcp,disk"

docker stop test
docker rm test

The output is:

... some data ...

+ docker stop test
test
+ docker rm test
Error response from daemon: Unable to remove filesystem for 7d7513b0c3310f26e7425728f9c34e219db53a5e4dbb6e0e4259c2e6eb760044: remove /var/lib/docker/containers/7d7513b0c3310f26e7425728f9c34e219db53a5e4dbb6e0e4259c2e6eb760044/shm: device or resource busy
amcrn commented 8 years ago

On Ubuntu 14.04, using --disable_metrics="tcp,disk" still does not fix the problem. I've confirmed @ashkop 's observation: If cAdvisor is started after another container, then removing said container fails.

theroys commented 8 years ago

To get around this issue i have tried running cadvisor as standalone..however it does not get data while i am using RHEL , cadvisor complains "unable to get fs usage from thin pool for device".. it seems it cant get right information about the storage driver. Using RHEL 7.1 version 0.23.3 (6607e7c) docker 1.9.1

Anybody tried similar

srstsavage commented 8 years ago

This issue is hitting us often and affecting production container deployments (Debian 8.5 hosts, Docker 1.11.1).

Can anyone spell out what we lose by omitting the /:/rootfs:ro mount? Is it just disk usage metrics?

vishh commented 8 years ago

AFAIK, it should be just the disk usage metrics

On Tue, Jul 19, 2016 at 2:38 PM, Shane StClair notifications@github.com wrote:

This issue is hitting us often and affecting production container deployments (Debian 8.5 hosts, Docker 1.11.1).

Can anyone spell out what we lose by omitting the /:/rootfs:ro mount? Is it just disk usage metrics?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/cadvisor/issues/771#issuecomment-233774348, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKN3e53lwmDwcVP7hDBloCHdfD_Dsks5qXUO_gaJpZM4FBIxe .