"ps" show 0.0 of %cpu for a busy process in container

borgerli commented 3 years ago

I'm using lxcfs 4.0.7. I created an container whit lxcfs proc files mount, then kicked off a process while true;do echo test > /dev/null;done. In container, top command showed the correct %cpu information, while ps always showed 0.0. However when not using lxcfs, ps worked well.

Steps

start lxcfs

`/usr/local/bin/lxcfs -l --enable-cfs --enable-pidfd /var/lib/lxc/lxcfs

start docker container

docker run -it -m 128m --cpus=1 --rm \
  -v /var/lib/lxc/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \
  -v /var/lib/lxc/lxcfs/proc/diskstats:/proc/diskstats:rw \
  -v /var/lib/lxc/lxcfs/proc/meminfo:/proc/meminfo:rw \
  -v /var/lib/lxc/lxcfs/proc/stat:/proc/stat:rw \
  -v /var/lib/lxc/lxcfs/proc/swaps:/proc/swaps:rw \
  -v /var/lib/lxc/lxcfs/proc/loadavg:/proc/loadavg:rw \
  -v /var/lib/lxc/lxcfs/proc/uptime:/proc/uptime:rw \
  -v /var/lib/lxc/lxcfs/sys/devices/system/cpu/online:/sys/devices/system/cpu/online:rw \
  -v /var/lib/lxc:/var/lib/lxc:rshared \
  centos:7 /bin/bash

test

top shows 100.0, while ps shows 0.0 for process 16

[root@af61796cf0ed /]# while true; do echo test > /dev/null;done &
[1] 16
[root@af61796cf0ed /]# top -b -n 1
top - 03:06:17 up 0 min,  0 users,  load average: 0.00, 0.00, 0.00
Tasks:   3 total,   2 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.0 us,  0.0 sy,  0.0 ni, 88.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :   131072 total,   127272 free,     3800 used,        0 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   127272 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   16 root      20   0   11840    396      0 R 100.0  0.3   0:03.40 bash
    1 root      20   0   11840   2984   2588 S   0.0  2.3   0:00.04 bash
   17 root      20   0   56064   3696   3248 R   0.0  2.8   0:00.00 top
[root@af61796cf0ed /]# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  2.2  11840  2984 pts/0    Ss   03:05   0:00 /bin/bash
root        16  0.0  0.3  11840   396 pts/0    R    03:06   0:09 /bin/bash
root        18  0.0  2.6  51744  3416 pts/0    R+   03:06   0:00 ps aux

test without lxcfs

top shows 100.0, and ps shows 102

root@dev:~# docker run -it -m 128m --cpus=1 centos:7 /bin/bash
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
[root@0d6cb011e598 /]# while true; do echo test > /dev/null;done &
[1] 16
[root@0d6cb011e598 /]# top -b -n 1
top - 03:09:29 up 7 days, 15:12,  0 users,  load average: 2.07, 1.79, 1.42
Tasks:   3 total,   2 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.4 us, 10.6 sy,  0.0 ni, 73.2 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 st
KiB Mem : 16260516 total,  1436644 free,  1152404 used, 13671468 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 14788144 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   16 root      20   0   11840    396      0 R 100.0  0.0   0:05.02 bash
    1 root      20   0   11840   2912   2516 S   0.0  0.0   0:00.03 bash
   17 root      20   0   56064   3656   3212 R   0.0  0.0   0:00.00 top
[root@0d6cb011e598 /]# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  0.0  11840  2912 pts/0    Ss   03:09   0:00 /bin/bash
root        16  102  0.0  11840   396 pts/0    R    03:09   0:07 /bin/bash
root        18  0.0  0.0  51744  3392 pts/0    R+   03:09   0:00 ps aux
[root@0d6cb011e598 /]#

brauner commented 3 years ago

LXCFS virtualizes cpu utilization according to the cgroup the target process is in. If it's not using a lot of cpu then you won't see anything. Try to create some load by e.g. calling stress with the cpu option inside of the container and you should see an increase.

borgerli commented 3 years ago

@brauner Thanks for comment.

Acutally, we did run a process which uses lots of cpu(while true; do echo test > /dev/null;done &). And as you suggested, I tested with stress, and got the same result: top showed ~100 %cpu, but ps still 0.0 %cpu.

start lxcfs: · /usr/local/bin/lxcfs -l --enable-cfs --enable-pidfd /var/lib/lxc/lxcfs

start stress container

docker run -it --name stress -m 128m --cpus=1 --rm \
-v /var/lib/lxc/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \
-v /var/lib/lxc/lxcfs/proc/diskstats:/proc/diskstats:rw \
-v /var/lib/lxc/lxcfs/proc/meminfo:/proc/meminfo:rw \
-v /var/lib/lxc/lxcfs/proc/stat:/proc/stat:rw \
-v /var/lib/lxc/lxcfs/proc/swaps:/proc/swaps:rw \
-v /var/lib/lxc/lxcfs/proc/loadavg:/proc/loadavg:rw \
-v /var/lib/lxc/lxcfs/proc/uptime:/proc/uptime:rw \
-v /var/lib/lxc/lxcfs/sys/devices/system/cpu/online:/sys/devices/system/cpu/online:rw \
progrium/stress --cpu 1

get into the container and verity %cpu with top(99.6) and ps(0.0)


root@borgerli-devcloud:~# docker exec -it $(docker inspect stress -f "{{.Id}}") /bin/bash
root@33bc005fa2d5:/# top -b -n 1
top - 02:34:42 up 4 min,  0 users,  load average: 0.28, 0.07, 0.02
Tasks:   4 total,   2 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98.5 us,  0.0 sy,  0.0 ni,  1.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    131072 total,     3660 used,   127412 free,        0 buffers
KiB Swap:        0 total,        0 used,        0 free.        4 cached Mem

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
7 root      20   0    7316     96      0 R 99.6  0.1   4:03.82 stress
1 root      20   0    7316    896    812 S  0.0  0.7   0:00.02 stress
28 root      20   0   18164   3300   2828 S  0.0  2.5   0:00.02 bash
36 root      20   0   19748   2372   2124 R  0.0  1.8   0:00.00 top
root@33bc005fa2d5:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.6   7316   896 pts/0    Ss+  02:30   0:00 /usr/bin/stress --verbose --
root         7  0.0  0.0   7316    96 pts/0    R+   02:30   4:10 /usr/bin/stress --verbose --
root        28  0.0  2.5  18164  3300 pts/1    Ss   02:34   0:00 /bin/bash
root        37  0.0  1.5  15576  2064 pts/1    R+   02:34   0:00 ps aux


![screenshot](https://raw.githubusercontent.com/borgerli/lxcfs-admission-webhook/master/lxc_ps_top.png)

brauner commented 3 years ago

Odd, what happens, if you turn off cpu shares, i.e. skip --enable-cfs?

borgerli commented 3 years ago

@brauner I checked procps code related to pcpu, and found the reason of this issue.

As shown in below code of procps, if lxcfs uptime mounted in containers, the seconds_since_boot (since the container starts) will be always little than the process start_time(since the boot time of the host). And as a result, seconds will always be zero, and then pcpu is also zero.

https://gitlab.com/procps-ng/procps/-/blob/master/ps/output.c#L525:

  seconds = cook_etime(pp);
  if(seconds) pcpu = (total_time * 1000ULL / Hertz) / seconds;

https://gitlab.com/procps-ng/procps/-/blob/master/ps/output.c#L136:

#define cook_etime(P) (((unsigned long long)seconds_since_boot >= (P->start_time / Hertz)) ? ((unsigned long long)seconds_since_boot - (P->start_time / Hertz)) : 0)

A workaround is not to mount lxcfs proc/uptime for containers. But this will make containers lose uptime virtualization.

Is it possible for lxcfs to just return host uptime when the calling progress comm is `ps' ?

borgerli commented 3 years ago

@brauner I submitted a PR for this issue, please review. Thank you.

PR #445

borgerli commented 3 years ago

@brauner Could you please help review the PR?

mihalicyn commented 7 months ago

Hi @borgerli

Sorry for a long delay with response from us. We are working on sorting out issues here and there right now.

I have read through your PR and understood the idea. But the question is that if we can, instead of adding hacks to LXCFS, fix procps utils not to use the uptime value to calculate CPU load and adjust algorithm to be similar to what we have in top utility?

mihalicyn commented 7 months ago

cc @stgraber

stgraber commented 7 months ago

Yeah, returning different output based on command name definitely isn't something I'd want is to do. It's way too hacky and will let to an undebugable mess.

Tweaking userspace to be a bit smarter would definitely be easier on this case. Especially as there's no way for us to visualize those per-process files.

Once we get @mihalicyn 's work to have lxcfs features per container, then you'd also get the ability to turn off the uptime virtualization where it remains problematic.

lxc / lxcfs