google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.74k stars 1.29k forks source link

Runsc creates unreasonably large number of golang worker threads, eventually hitting any pids cgroup limit #3942

Open majek opened 4 years ago

majek commented 4 years ago

Description

Runsc/gvisor leaves old threads in a stuck state. This leads to exceeding any sane pids cgroup quota, and is generally bad.

Steps to reproduce

  1. Gvisor version

    $ strings /usr/local/bin/runsc|grep release-
    release-20200907.0-44-g7e0b019b84b9
  2. Run it as docker thingy

    $ docker run --runtime=runsc --rm -it ubuntu /bin/bash
    root@329dded2d7e7:/# 
  3. Inspect the cgroup

    $ cat /sys/fs/cgroup/pids/docker/329dded2d7e7*/tasks|wc -l
    70
  4. Launch a command inside docker

    root@329dded2d7e7:/# ls
  5. Check the cgroup again

    $ cat /sys/fs/cgroup/pids/docker/329dded2d7e7*/tasks|wc -l
    72

The number of threads should not grow, old stuff should be reaped. Specifically you will notice that threads in status "tracing stop" are growing

$ for i in `cat ./tasks`; do cat /proc/$i/status; echo; done|grep State|sort|uniq -c
     46 State:  S (sleeping)
     41 State:  t (tracing stop)
$ for i in `cat ./tasks`; do cat /proc/$i/status; echo; done|grep State|sort|uniq -c
     46 State:  S (sleeping)
     44 State:  t (tracing stop)

This is an example kernel stack

[<0>] ptrace_stop+0x13f/0x2d0
[<0>] get_signal+0x6b5/0xd50
[<0>] do_signal+0x34/0x900
[<0>] __prepare_exit_to_usermode+0x125/0x200
[<0>] prepare_exit_to_usermode+0x9/0x30
[<0>] idtentry_exit_user+0x40/0x60
[<0>] exc_int3+0x74/0x220
[<0>] asm_exc_int3+0x31/0x40

Environment

Please include the following details of your environment:

ianlewis commented 4 years ago

FWIW, It shouldn't be runsc -v, it should be runsc -version. I'll update the issue template.

avagin commented 4 years ago

You use the ptrace platform. It is known issues. Each time when the Sentry resumes a guest process, it checks whether the current system thread of the sentry process traces the guest process and if it doesn't, it creates a new guest thread and starts tracing it: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/ptrace/subprocess.go;l=88?q=lookupOrCreate&ss=gvisor%2Fgvisor

So the maximum number of guest pid-s is the multiplication of sentry threads and guest processes.

avagin commented 4 years ago

Screenshot from 2020-09-15 09-43-53 https://www.infoq.com/presentations/gvisor-os-go/

majek commented 4 years ago

@avagin said:

So the maximum number of guest pid-s is the multiplication of sentry threads and guest processes.

I don't follow. If I start 2000 processes in the container, one after another, not concurrently, and if all of them exit, why do I end up with > 2000 threads? If I start another 2000 processes should I expect 4000 threads?

I don't mind number of threads be proportional to number of currently-running-guest-threads. I do mind number of threads to be proportional to number of processes ever created during lifetime of container. Is that the case?

avagin commented 4 years ago

I don't follow. If I start 2000 processes in the container, one after another, not concurrently, and if all of them exit, why do I end up with > 2000 threads? If I start another 2000 processes should I expect 4000 threads?

This is strange. I tried to reproduce your case, but I failed. I have only 119 pids in a container cgroup.

T1: $ docker run --device /dev/fuse --runtime runsc --privileged -it --rm b4c85473e5d0 bash

T2: $ cat /sys/fs/cgroup/pids//docker/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/tasks | wc -l 48

T1: root@8da2bd97657b:/fuse-test# for i in seq 2000; do sh -c true; done

T2: $ cat /sys/fs/cgroup/pids//docker/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/tasks | wc -l 119 T2: ps axf

   3517 ?        Ssl    4:12 /usr/bin/containerd
3463297 ?        Sl     0:00  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtim
3463312 ?        Sl     0:00      \_ runsc-gofer --root=/var/run/docker/runtime-runsc/moby --debug=true --log=/run/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/log.json --log-format=json --debug-log=/tmp/runsc/logs/
3463316 pts/0    Ssl+   0:37      \_ runsc-sandbox --root=/var/run/docker/runtime-runsc/moby --debug=true --log=/run/containerd/io.containerd.runtime.v1.linux/moby/8da2bd97657b6e55f1ffe0f2c35e121ce6ed21596c2ca064f4b3ec88c4972775/log.json --log-format=json --debug-log=/tmp/runsc/log
3463332 ?        tsl    0:00          \_ [exe]
3463392 ?        tl     0:05              \_ [exe]
3463611 ?        tl     0:06              \_ [exe]
3463617 ?        tl     0:06              \_ [exe]
avagin@avagin:~$ ls -l /proc/3463617/task/ | wc -l
22
avagin@avagin:~$ ls -l /proc/3463611/task/ | wc -l
22
avagin@avagin:~$ ls -l /proc/3463392/task/ | wc -l
24
avagin@avagin:~$ ls -l /proc/3463332/task/ | wc -l
6
avagin@avagin:~$ ls -l /proc/3463316/task/ | wc -l
27
avagin@avagin:~$ ls -l /proc/3463312/task/ | wc -l
24
avagin@avagin:~$ ls -l /proc/3463297/task/ | wc -l
11
majek commented 4 years ago

Yeah, so you are saying the number of gvisor threads is capped. In this ticket I thought it is not capped.

What happens in my case, is: I set a limit of pids.max in the cgroup and of course gvisor eventually hits it. I'm trying to trace now just what the limit is, and under what gvisor version it's reached the fastest. I think we had cases where even modest limit (say 10k pids) was reached with a guest program operating just a few dozen threads. I wonder if the cpu pinning (as removed few days ago in f3172c3a11dda8aa21bbaf51b7f63196592cf1da, I often work on 96 logical cpu machines, so maybe that has some side effect) or reaching cfs_cpu_quota could have implications on internal gv thread reuse.

The problem i want to solve: I want guest to have limited number of pids. I want to make sure the OS is not going to see effects of fork bomb inside container. I'm trying to solve it with cgroup pids.max. Somewhat related https://github.com/google/gvisor/issues/2489

majek commented 4 years ago

Ok, I think I have some example. Here is the scenario:

$ /usr/local/bin/runsc -version
runsc version release-20200907.0-44-g7e0b019b84b9

$ docker run --runtime=runsc --rm -it ubuntu /bin/bash
root@46a0985cd38c:/# while [ 1 ]; do ls > /dev/null; done

So here you have it. Only a couple of processes running concurrently (bash, ls, that's it?). If you count the tasks, it will eventually stabilize on 265. Well, a lot, but I won't despair over it:

$ cat /sys/fs/cgroup/pids/docker/46a0*/tasks|wc -l
265

But now, when I repin them to 0th cpu, the number jumps to 616.

$ sudo bash -c '(for i in `cat /sys/fs/cgroup/pids/docker/46a0*/tasks`; do taskset -c -p 0 $i; done)'
$ cat /sys/fs/cgroup/pids/docker/46a0*/tasks|wc -l
616

This goes into insane territory. Do we really need 562 idling OS threads to handle one running ls process? The machine has 48 logical cpus and I think the cpu count makes a difference. On another box with 96 logical cpus the number jumped from 241 to 496 on repin.

Another example:

$ docker run --runtime=runsc --rm -it ubuntu /bin/bash
root@0d3cf50df413:/# apt update; apt -y install psmisc
root@0d3cf50df413:/# for j in `seq 20`; do for i in `seq 20` ; do (cat /dev/urandom > /dev/null & ); done; sleep 5; killall cat; done
...
$ cat /sys/fs/cgroup/pids/docker/0d3*/tasks|wc -l
2250
majek commented 4 years ago

Citing @prattmic from chat:

The tracing stop threads are unused stub threads that are eligible to run user code (stud threads within the same process are eligible only to run user code for user threads in the same process, or a completely unused stub process is eligible to be used for a new user process). We don't currently explicitly try to hard cap the number of stubs we may create (this is something on the roadmap). However, we do try to reuse aggressively, so a well-behaved application (like ls in a loop in your example) shouldn't be creating such a huge number of threads. I believe that is a bug.

majek commented 4 years ago

If I read it right, it is square - number-of-guest-pids * number-of-host-golang-threads

Host golang can have pretty much unlimited number of threads (I think 10k is the default max). Then when we context switch to guest, we take the host thread id like: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/ptrace/subprocess.go;l=503?q=lookupOrCreate&ss=gvisor%2Fgvisor

func (s *subprocess) switchToApp(c *context, ac arch.Context) bool {
    currentTID := int32(procid.Current())
    t := s.sysemuThreads.lookupOrCreate(currentTID, s.newThread)

Notice, AFAIU procid.Current is the tid of the host side thread. Then we look for guest thread counterpart, and this would have been fine, but the s.sysemuThreads.lookupOrCreate is scoped per subprocess ie the guest thread.

So in worst case each host thread will have at least one (at most two) threads per each guest thread. Did I read it right?

avagin commented 4 years ago

So in worst case each host thread will have at least one (at most two) threads per each guest thread. Did I read it right?

This isn't quite right. First, threadPool is per-process not not per-thread, so in worst case each host thread will have one thread per each guest PROCESS.

I don't understand when each host thread can have TWO threads per each guest thread.

majek commented 4 years ago

(1) you are right, let me revise what I said:

So in worst case each host thread will have at least one (at most two) threads per each guest process.

So, for a guest with say 100 processes and host with say 100 golang threads, we could see up to 20k worker threads on gvisor side.

(2) One thread in s.sysemu threadPool and another thread in s.syscallThreads pool. Or did I read it wrong?

avagin commented 4 years ago

(2) One thread in s.sysemu threadPool and another thread in s.syscallThreads pool. Or did I read it wrong?

You are right. I forgot that we have a separate pool for host syscalls.

avagin commented 4 years ago

I see you change the subject. I think the "unreasonably" isn't a good word in this context. You need to understand why it works this way. The kernel ptrace API works on the thread level and one thread can be traced only one another thread. Go-runtime creates a number of system threads and then each goroutine can be running in context of any of this system threads. This means that if subprocess.Syscall calls from the system thread A, creates a stub thread Z, starts tracing it and executes a system call in its context, then if subprocess.Syscall will be called from the system thread B, it will not able to reuse the stub thread Z, because it is already traced by the system thread A.

You can see here is the reason why the Sentry needs to create many threads. You can try to fix this issue, if you will pin go-routines to system threads, but I am not sure that this will not affect a performance. Or you can try to use the kvm platform, it doesn't have this problem.

avagin commented 4 years ago

Here is an example how we can reduce the number of subprocess.syscallThreads: https://github.com/avagin/gvisor/commit/5254cf371696682c4eae46fec119f394bdc31400

The same approach can be implemented for the sysemuThreads pool, but this will be a bit more complicated.

majek commented 4 years ago

I will push back. Gvisor is a security isolation layer. Fork bomb is a usual problem. Right now the only way to solve the fork bomb is to put gvisor into pids cgroup and set pids.max to reasonable value that will not overwhelm the system. This feature of gvisor to require quadratic number of OS threads for runsc the process, is making this approach infeasible.

There are two direct questions:

avagin commented 4 years ago

First of all, you should not use the ptrace platform in this case. It has not been designed to cover this case. You can use the kvm platform and limit the number of goruntime threads with cgroup pids.max.

The world isn't just black or white. I already explained that the ptrace kernel interface has limitations. Any thread can be traced only by one another thread. For the ptrace platform, this means that we have two choices:

ptrace was designed for testing and quick experiments. It works fine for workloads with small amount of tasks. For testing needs, the behavior should be close to the other gvisor platforms, this is probably a reason the variant with pinned task goroutines has not been implemented.

@majek do you have a real use cases where you need to use the ptrace platform? Maybe you can describe it and we will better understand your problem and ways to solve it. I implemented a POC version of the first variant with pinned goroutines https://github.com/google/gvisor/pull/4487/commits/fd14898a87b9bd2395147201ed89bfb1fdd5ea7f, you can try to play with it.

majek commented 4 years ago

I must be missing something larger here. KVM was experimental till not very recent. People built products on ptrace. ARM kvm support is not finished (https://github.com/google/gvisor/issues/63). Are you saying that ptrace is not production ready? Did I miss some big platform discussion in documentation?

amscanne commented 4 years ago

Ptrace is solid, notwithstanding the noted requirement that a containing cgroup is used appropriately. (But the sandbox should be contained in appropriate CPU, memory, etc. cgroups regardless to prevent exhausting other kinds of resources.) I think that Andrei is just saying that ptrace may not scale well for every case (e.g. a massively multithreaded web application may not play with the pid semantics).

But let's get back on track. First, I totally agree with the fork bomb complaint. Even with the the proper pid limit, it will generally manifest as the Go runtime failing to create a new thread internally and dying. This is not acceptable and a flaw worth fixing. I'm not sure that the existence of many OS-level threads is an actual concern, but the ergonomics are definitely bad.

There are two tracks to pursue here. First, @avagin is actually working on a new platform designed to replace ptrace (runs anywhere) that is faster and doesn't suffer from this problem, so you should connect with him. It would be great to understand your use case to help ensure that the platform is addressing the core problems. Maybe a community meeting?

Second, that might take a little bit, so we should improve this issue for ptrace in the meantime. But it's unclear to me: do you want the n^2 to go be limited to n (which can be accomplishing with the pinning, which we could integrate for ptrace at least) or should it actually be limited to a fixed number (which would require some kind of Switch() admission control... also doable)? I think this is why Andrei is asking about the use case, to understand what is the desired state.

majek commented 4 years ago

But it's unclear to me: do you want the n^2 to go be limited to n ... asking about the use case

I'm running untrusted workloads. Many of them. As an operator I would like to guarantee that a single workload won't be too large burden to kernel scheduler, cgroups machinery etc. Therefore I am limiting the runsc containers in cgroups. I am successfully using memory cgroup (although the oom crashes are awful), cpu cgroup. I attempted to run runsc under the pids cgroup but that failed since a substantial overhead in case of number of used tasks is coming from runsc internal. I can't put runsc and its childs under pids cgroup becasue coming with any number for pids.max is wrong. I want to express something like: the untrusted guest should not be able to run more than say 512 tasks (threads or processes). Gvisor right now does not have a functionality for that, and becasue of ptrace architectural limitations I can't technically put runsc and childs in pids cgroup. Please either:

avagin commented 4 years ago

https://github.com/google/gvisor/commits/test/cl336693030

@majek could you try out this change? This is an implementations of the second variant. With it, the ptrace platform creates two threads (tracee and tracer) for each guest thread and two extra threads for each guest process to execute system calls.

As for the first variant, you can easy implement it by limiting a number of pids in the root pid namespace.

majek commented 3 years ago

@avagin This change is awesome, The runsc thread count falls down nicely after guest processes exit. On my box a new process in guest (cat) is creating 8-9 golang threads which seem predictable and fine. I have not done benchmarks to assess performance difference for multi-threaded programs.

majek commented 3 years ago

Benchmark time! Unless I did something wrong with my benchmark, the patched branch seems 2x slower than master. Code is here: https://gist.github.com/majek/6ffba6d70b56a95eb36258be11955b97

It creates 100 threads, and each does mmap/munmap 20k times. I chose mmap because afaiu it often goes to the host kernel - this syscall is often (always?) not terminated at gvisor level.

runsc from master:

$ runsc --version
runsc version release-20201019.0-33-g293877cf647a
spec: 1.0.1-dev

$ sudo docker run --runtime=runsc --hostname=vanilla --rm -it xxx 
root@vanilla:/# time /root/thread 

real    2m0.686s
user    1m1.300s
sys 199m42.040s

Patched runsc

$ runsc-patched --version
runsc version release-20201019.0-2-g9e95a7463960
spec: 1.0.1-dev

$ sudo docker run --runtime=runsc-patched --hostname=patched --rm -it octane /bin/bash
root@patched:/# time /root/thread 

real    4m41.731s
user    1m28.920s
sys 467m0.840s

sys time looks out of the blue, but the "real" time is indeed 2x worse in the patched runsc. the box is 4 core (8 logical cpu) i7-7700HQ ubuntu on 5.2 kernel.

The benchmark can be probably simplified, I havent tried one thread.

lnsp commented 3 years ago

OK, a bit more data from my machine if we run the same benchmark with 1 thread instead of 100.

runsc$ time ./main
real    0m4.986s
user    0m3.650s
sys     0m1.350s
runsc-avagin$ time ./main
real    0m8.511s
user    0m1.440s
sys     0m7.080s

So about 1.7x worse performance.

(with 4 core Debian VM (Linux 5.8) on Intel i7-8559U @ 2.70GHz)

avagin commented 3 years ago

So about 1.7x worse performance.

I think I fixed this issue. Could you check it again? Thanks!

lnsp commented 3 years ago

This is very interesting, I get around

I ran them on a different machine this time (3770K, Linux 5.9) since the benchmarks inside the VM were very flaky. Here's a gist of the measurements.

majek commented 3 years ago

Repeating my tests from above:

$ runsc-patched --version
runsc version release-20201019.0-70-g8176f6431ec8
spec: 1.0.1-dev

root@patched:/# time /root/thread

real    3m30.277s
user    1m29.110s
sys 347m4.160s

This looks like improvement over the previous version, but still a perf hit over vanilla.

avagin commented 3 years ago

I warned you that there are reasons to fork these unreasonably large number of threads :). My changes pins task goroutines to system threads and so each switch between task gorotines requires a system context switch.

There is another idea how to fix this issue. We can attach/detach to a stub thread each time when we need to execute something in its context.

dqminh commented 3 years ago

@avagin @amscanne I would love to know more about a new platform to replace ptrace. One of the thing i have been struggling with is that even with kvm ( which performs much better than ptrace ), we still spend a lot of non-application CPU time, which seems quite wasteful.

For example, here is a container that runs chromium inside gvisor Screenshot from 2020-11-30 14-24-01

As you can see, system CPU is much higher than user CPU, i think we prefer the opposite. a 10s kvm stat record shows:

Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

                 HLT      41114    43.61%    98.38%     15.52us 1638547.94us   1413.78us ( +-   5.62% )
       EPT_VIOLATION      30681    32.55%     0.63%      1.48us   5291.40us     12.04us ( +-   2.49% )
  EXTERNAL_INTERRUPT      16379    17.37%     0.95%      1.01us   9098.23us     34.13us ( +-   7.59% )
       EXCEPTION_NMI       5946     6.31%     0.04%      1.07us   7117.78us      4.44us ( +-  29.40% )
               CPUID         66     0.07%     0.01%      1.28us   3052.47us     48.27us ( +-  95.75% )
   PENDING_INTERRUPT         56     0.06%     0.00%      1.55us     11.02us      2.74us ( +-   9.36% )
   PAUSE_INSTRUCTION         30     0.03%     0.00%      3.20us      8.13us      4.97us ( +-   3.82% )

Total Samples:94272, Total events handled time:59084468.87us.

I think HLT ( and VM exit ) in general is very expensive for sentry, but i have no idea how we can optimize that. Also what is that EPT_VIOLATION and when does it happen ?

amscanne commented 3 years ago

Interesting, what is generating that table? I suspect that this is "time to service the exit", i.e. time between the exit and a subsequent KVM_RUN. In this case, the high time for HLT is not surprising, as the way the kvm platform works, the Sentry may not run that vCPU again for some time.

The other times seem crazy high though. EPT_VIOLATION should be the equivalent of get_user_pages, and it's weighing in at 12us average? That's a lot. Similar story for EXTERNAL_INTERRUPT and CPUID, both those times are large compared to what I would expect. Is there something else going on here? Is this running nested?

The user time vs system time is also very interesting -- the system time would not be accounted in the table there (but it does align with the huge times we see). Is there a perf top10 or something that says where the system time is going?

dqminh commented 3 years ago

@amscanne Apologies, i ran the test on a test host that has kasan enabled, so the numbers are very much inflated. We can discard the previous report. Here's some numbers that we captured on a production node.

Screenshot from 2020-11-30 23-36-22

The report is generated with:

$ sudo timeout 10 perf kvm stat record -p $PID
$ sudo perf kvm stat report

Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

                 hlt      42683    55.41%    99.88%      5.00us 4101811.63us   1730.63us ( +-   7.74% )
           interrupt      25461    33.06%     0.07%      0.31us   1548.01us      1.92us ( +-   7.50% )
                 npf       8882    11.53%     0.06%      1.36us    591.95us      4.79us ( +-   2.32% )

Total Samples:77026, Total events handled time:73959951.08us.

I suspect that this is "time to service the exit", i.e. time between the exit and a subsequent KVM_RUN

You are right that this is time between exit and entry ( https://github.com/torvalds/linux/blob/master/tools/perf/arch/x86/util/kvm-stat.c#L23-L24 and https://github.com/torvalds/linux/blob/master/tools/perf/builtin-kvm.c#L75-L101 ). Which then makes sense for such high value, but i assume that we would still want this to be as low as possible ?

Attached is a flamegraph for KVM task, captured with sudo perf record -a -g --call-graph dwarf -F1000 -e cpu-clock:HG -p $RUNSC_SANDBOX_PID -- sleep 10

flamegraph.cpuclock.zip Screenshot from 2020-12-01 00-32-03

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 120 days.

github-actions[bot] commented 10 months ago

This issue has been closed due to lack of activity.

jseba commented 5 months ago

I'd like to re-open this issue in light of GPU workloads requiring the use of systrap/ptrace. Since KVM is not an option there, this is now causing us issues again with too many processes masquerading as threads. Can these process be tagged in any way so that something like eBPF can ignore them when it's scanning processes?

github-actions[bot] commented 1 month ago

A friendly reminder that this issue had no activity for 120 days.