kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
109.77k stars 39.31k forks source link

application crash due to k8s 1.9.x open the kernel memory accounting by default #61937

Closed wzhx78 closed 3 years ago

wzhx78 commented 6 years ago

when we upgrade the k8s from 1.6.4 to 1.9.0, after a few days, the product environment report the machine is hang and jvm crash in container randomly , we found the cgroup memory css id is not release, when cgroup css id is large than 65535, the machine is hang, we must restart the machine.

we had found runc/libcontainers/memory.go in k8s 1.9.0 had delete the if condition, which cause the kernel memory open by default, but we are using kernel 3.10.0-514.16.1.el7.x86_64, on this version, kernel memory limit is not stable, which leak the cgroup memory leak and application crash randomly

when we run "docker run -d --name test001 --kernel-memory 100M " , docker report WARNING: You specified a kernel memory limit on a kernel older than 4.0. Kernel memory limits are experimental on older kernels, it won't work as expected and can cause your system to be unstable.

k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go

-       if d.config.KernelMemory != 0 {
+           // Only enable kernel memory accouting when this cgroup
+           // is created by libcontainer, otherwise we might get
+           // error when people use `cgroupsPath` to join an existed
+           // cgroup whose kernel memory is not initialized.
            if err := EnableKernelMemoryAccounting(path); err != nil {
                return err
            }

I want to know why kernel memory open by default? can k8s consider the different kernel version?

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened: application crash and cgroup memory leak

What you expected to happen: application stable and cgroup memory doesn't leak

How to reproduce it (as minimally and precisely as possible): install k8s 1.9.x on kernel 3.10.0-514.16.1.el7.x86_64 machine, and create and delete pod repeatedly, when create more than 65535/3 times , the kubelet report "cgroup no space left on device" error, when the cluster run a few days , the container will crash.

Anything else we need to know?:

Environment: kernel 3.10.0-514.16.1.el7.x86_64

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"


- Kernel (e.g. `uname -a`):  3.10.0-514.16.1.el7.x86_64 
- Install tools: rpm 
- Others:
qkboy commented 6 years ago

Use below test case can reproduce this error: first, make cgroup memory to be full:

# uname -r
3.10.0-514.10.2.el7.x86_64
# kubelet --version
Kubernetes 1.9.0
# mkdir /sys/fs/cgroup/memory/test
# for i in `seq 1 65535`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done
# cat /proc/cgroups |grep memory
memory  11      65535   1

then release 99 cgroup memory that can be used next to create:

# for i in `seq 1 100`;do rmdir /sys/fs/cgroup/memory/test/test-${i} 2>/dev/null 1>&2; done 
# mkdir /sys/fs/cgroup/memory/stress/
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done 
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device <-- notice number 100 can not create
# for i in `seq 1 100`;do rmdir /sys/fs/cgroup/memory/test/test-${i}; done <-- delete 100 cgroup memory
# cat /proc/cgroups |grep memory
memory  11      65436   1

second, create a new pod on this node. Each pod will create 3 cgroup memory directory. for example:

# ll /sys/fs/cgroup/memory/kubepods/pod0f6c3c27-3186-11e8-afd3-fa163ecf2dce/
total 0
drwxr-xr-x 2 root root 0 Mar 27 14:14 6d1af9898c7f8d58066d0edb52e4d548d5a27e3c0d138775e9a3ddfa2b16ac2b
drwxr-xr-x 2 root root 0 Mar 27 14:14 8a65cb234767a02e130c162e8d5f4a0a92e345bfef6b4b664b39e7d035c63d1

So when we recreate 100 cgroup memory directory, there will be 4 item failed:

# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done    
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-97’: No space left on device <-- 3 directory used by pod
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-98’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-99’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device
# cat /proc/cgroups 
memory  11      65439   1

third, delete the test pod. Recreate 100 cgroup memory directory before confirm all test pod's container are already destroy. The correct result that we expected is only number 100 cgroup memory directory can not be create:

# cat /proc/cgroups 
memory  11      65436   1
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done 
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device

But the incorrect result is all cgroup memory directory created by pod are leaked:

# cat /proc/cgroups 
memory  11      65436   1 <-- now cgroup memory total directory
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done    
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-97’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-98’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-99’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device

Notice that cgroup memory count already reduce 3 , but they occupy space not release.

wzhx78 commented 6 years ago

/sig container /kind bug

wzhx78 commented 6 years ago

@kubernetes/sig-cluster-container-bugs

feellifexp commented 6 years ago

This bug seems to be related: https://github.com/opencontainers/runc/issues/1725

Which docker version are you using?

qkboy commented 6 years ago

@feellifexp with docker 1.13.1

frol commented 6 years ago

There is indeed a kernel memory leak up to 4.0 kernel release. You can follow this link for details: https://github.com/moby/moby/issues/6479#issuecomment-97503551

wzhx78 commented 6 years ago

@feellifexp the kernel log also have this message after upgrade to k8s 1.9.x

kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x8020)

wzhx78 commented 6 years ago

I want to know why k8s 1.9 delete this line if d.config.KernelMemory != 0 { in k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go

feellifexp commented 6 years ago

I am not an expert here, but this seems to be change from runc, and the change was introduced to k8s since v1.8. After reading the code, it seems it impacts cgroupfs cgroup driver, while systemd driver is not changed. But I did not test the theory yet. Maybe experts from kubelet and container can chime in further.

kevin-wangzefeng commented 6 years ago

/sig node

kevin-wangzefeng commented 6 years ago

I want to know why k8s 1.9 delete this line if d.config.KernelMemory != 0 { in k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go

I guess https://github.com/opencontainers/runc/pull/1350 is the one you are looking for, which is actually an upstream change.

/cc @hqhq

wzhx78 commented 6 years ago

thanks @kevin-wangzefeng , the runc upstream had changed, I know why now , the change is `https://github.com/hqhq/runc/commit/fe898e7862f945fa3632580139602c627dcb9be0 , but enable kernel memory accounting on root by default , the child cgroup will enable also, this will cause cgroup memory leak on kernel 3.10.0, @hqhq , is there any way to let us enable or disable kernel memory by ourself? or get the warning log when the kernel < 4.0

hqhq commented 6 years ago

@wzhx78 The root cause is there are kernel memory limit bugs in 3.10, if you don't want to use kernel memory limit because it's not stable on your kernel, the best solution would be to disable kernel memory limit on your kernel.

I can't think of a way to workaround this on runc side without causing issues like https://github.com/opencontainers/runc/issues/1083 and https://github.com/opencontainers/runc/issues/1347 , unless we add some ugly logic like do different things for different kernel versions, I'm afraid that won't be an option.

wzhx78 commented 6 years ago

@hqhq it's exactly kernel 3.10's bug, but we spent more time to found it and it brought us big trouble on production environment, since we only upgrade k8s version from 1.6.x to 1.9.x. In k8x 1.6.x version , it doesn't open the kernel memory by default since runc had if condition. but after 1.9.x, runc open it by default. we don't want others who upgrade the k8s 1.9.x version had this big trouble. And runc is popular container solution, we think it need to consider different kernel version, at least, if runc can report the error message in kubelet error log when the kernel is not suitable for open kernel memory by default

wzhx78 commented 6 years ago

@hqhq any comments ?

hqhq commented 6 years ago

Maybe you can add an option like --disable-kmem-limit for both k8s and runc to make runc disable kernel memory accounting.

warmchang commented 6 years ago

v1.8 and all later versions will be affected by this. https://github.com/kubernetes/kubernetes/commit/e5a6a79fd75372fcc7fa32ccf8d80ed9e0335b17#diff-17daa5db16c7d00be0fe1da12d1f9165L39

image

wzhx78 commented 6 years ago

@warmchang yes.

Is this reasonable to add --disable-kmem-limit flag in k8s ? anyone can discuss this with us ?

like-inspur commented 6 years ago

I don't find there is a config named disable-kmem-limit for k8s. How to add this flag? @wzhx78

wzhx78 commented 6 years ago

k8s doesn't support now, we need to discuss with community whether is reasonable to add this flag in kubelet start option

gyliu513 commented 6 years ago

Not only 1.9, but also 1.10 and master have same issue. This is a very serious issue for production, I think providing a parameter to disable kmem limit is good.

/cc @dchen1107 @thockin any comments for this? Thanks.

wzhx78 commented 6 years ago

@thockin @dchen1107 any comments for this?

gyliu513 commented 6 years ago

@dashpole any reason to update memory.go as follows in https://github.com/kubernetes/kubernetes/commit/e5a6a79fd75372fcc7fa32ccf8d80ed9e0335b17#diff-17daa5db16c7d00be0fe1da12d1f9165L39 , this is seriously impacting Kubernetes 1.8, 1.9, 1.10, 1.11 etc.

-       if d.config.KernelMemory != 0 {
+           // Only enable kernel memory accouting when this cgroup
+           // is created by libcontainer, otherwise we might get
+           // error when people use `cgroupsPath` to join an existed
+           // cgroup whose kernel memory is not initialized.
            if err := EnableKernelMemoryAccounting(path); err != nil {
                return err
            }
dashpole commented 6 years ago

@gyliu513 enabling kernel memory accounting in that PR was not intentional. However, we do try and stay close to upstream runc so we can continue to receive bug-fixes and other improvements. The original runc bump in cAdvisor, which required me to update runc in kubernetes/kubernetes was for a bugfix. As pointed out in https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-377736075, the correct work-around here is to disable kernel memory accounting in your kernel.

gyliu513 commented 6 years ago

Thanks @dashpole , will do some test to disable kernel memory accounting https://github.com/kubernetes/kubernetes/blob/release-1.10/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go#L87-L89

luckyfengyong commented 6 years ago

@hqhq I understand we want to leverage the fix of runc from upstream as much as possible. However Kubernetes now does not support limiting the kernel memory yet, so the concern of https://github.com/opencontainers/runc/issues/1347 won't apply to Kuberentes.

This kind of memory leaking is really critical. It is also hard for customer to rebuild the kernel in a large product environment.

It is really great if we could resolve it in Kubernete by either the runc code or Kubernetes parameters.

gyliu513 commented 6 years ago

@dashpole more questions want to get your confirm, thanks!

1) Does there are any official way to disable kernel memory accounting? From here https://github.com/kubernetes/kubernetes/blob/release-1.10/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go#L87-L89 , just checking if the file exist or not, so seems deleting the file can disable kernel memory accounting, but would like to get more comments from you for an official way. 2) Any impact if we disable kernel memory accounting?

FYI @hchenxa

luckyfengyong commented 6 years ago

@gyliu513 You cannot or won't get chance to manually delete memory.kmem.limit_in_bytes. Those files are created automatically and inherited from parents directory when a sub-directory of memory cgroup is created by https://github.com/kubernetes/kubernetes/blob/release-1.10/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go#L43.

You have to disable it from kernel. It seems now the only way is to recompile kernel. Otherwise, you have to disable the whole memory subsystem of cgroup according to https://access.redhat.com/solutions/3217671

dashpole commented 6 years ago

@gyliu513

  1. Follow this comment to disable kernel memory accounting by recompiling the kernel: https://github.com/opencontainers/runc/issues/1725#issuecomment-380428228.
  2. I did some testing with kernel memory accounting disabled, and found that it made a relatively small impact on the ability of the kubelet to manage memory. I would recommend increasing --eviction-hard's memory.available parameter by 50Mi when disabling kernel memory accounting.

cc @filbranden FYI

filbranden commented 6 years ago

I think I'm in favor of adding a --disable-kmem-limit command-line flag... I guess that means first adding the plumbing through libcontainer to make that possible and then adding the flag to make Kubelet respect that...

Indeed, there's no good way to disable this system-wide except for recompiling the kernel... We've recently gone through the effort of rebuilding our 4.4 kernel systems to disable kmem accounting. (While it's desirable to enable it on systems with 4.13 or 4.14, where the accounting works properly without leaks and the information is useful to provide more precise memory accounting which should help in finding better targets for eviction in system OOMs.)

Cheers, Filipe

luckyfengyong commented 6 years ago

@dashpole Just curious why it is recommended to increase --eviction-hard's memory.available parameter by 50Mi. I thought currently kubernetes actually did not allow user to control kmem-limit but only mem-limit.

dashpole commented 6 years ago

@luckyfengyong the way the kubelet manages memory is roughly to compare memory.usage_in_bytes to memory_capacity - eviction_threshold. Because memory.usage_in_bytes will not include kernel memory, the measured usage_in_bytes will be lower than the actual, and we may OOM even when usage_in_bytes is not close to 0. In my testing, I found that there was 30-50 Mi of remaining kernel memory when the node is under memory pressure. One way to compensate for this is to increase the eviction threshold.

luckyfengyong commented 6 years ago

Thanks @dashpole.

I saw The main "kmem" counter is fed into the main counter, so kmem charges will also be visible from the user counter. in https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt. It sounds like the usage of kernel memory contributes to user memory. However I did not test it. So based on your test result, if kernel memory accounting is disabled, the kernel memory usage won't be counted into overall memory usage. For that case, yes, we need compensation for it.

It sounds like we all think --disable-kmem-limit is the right direction to resolve the issue. Do you plan to make the enhancement or expect contribution from others? We are glad to help on it.

filbranden commented 6 years ago

It would be great if you could contribute that...

From our digging up into it, it seems kubelet is the only process setting kmem limit (runc does not), so in that sense you only need to fix kubelet and not runc.

On the other hand, you probably need a change in libcontainer (part of runc) to make it possible for kubelet to skip setting the kmem limit there (since it seems it was the libcontainer change that triggered this, might need to be made conditional...)

Happy to help with code reviews and further guidance. @dashpole is definitely a good contact as well.

maxwell92 commented 6 years ago

It's interesting that only one machine ran into this trouble in our production cluster. I've no idea about why this "chosen" one instead of some two or more machines..

yeepaysre commented 6 years ago

maybe there are more pods created than others before, man

huzhengchuan commented 6 years ago

I meet the same question

[root@kube-manager01 ~]# kubectl version Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"1-9-3", GitTreeState:"clean", BuildDate:"2018-06-05T12:35:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"1-9-3", GitTreeState:"clean", BuildDate:"2018-06-05T12:35:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

[root@kube-manager01 ~]# uname -a Linux kube-manager01 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

tristanz commented 6 years ago

Is there any known workaround the avoids kernel recompilation? This is severe regression (breaks all workloads on loaded clusters) and appears to affect RHEL 7.x on 1.8.x and up.

w1ndy commented 6 years ago

Does anyone have luck with newer kernels on CentOS 7? Elrepo-kernel seems like a good option.

warmchang commented 6 years ago

@w1ndy , the company Redhat how to deal with this? The RHEL 7.x use the 3.10 kernel, and the OCP (openshift) would have this issue too.

w1ndy commented 6 years ago

@warmchang I'm not affiliated with Redhat but I believe they are working on this. See https://bugzilla.redhat.com/show_bug.cgi?id=1507149

lining2020x commented 6 years ago

Hi, @dashpole 🙂

I see you have tried to disable kernel memory accounting by recompiling the kernel and it seems to be all ok.

Have you encountered any problem ?

I tried the workaround in centos7.4 and encoured two problems:

  1. compile error which report as the following
    mm/memcontrol.c: In function 'mem_cgroup_resize_limit':
    mm/memcontrol.c:4637:15: error: 'memcg_limit_mutex' undeclared (first use in this function)
    mutex_lock(&memcg_limit_mutex);
               ^
    mm/memcontrol.c:4637:15: note: each undeclared identifier is reported only once for each function it appears in
    mm/memcontrol.c: In function 'mem_cgroup_resize_memsw_limit':
    mm/memcontrol.c:4697:15: error: 'memcg_limit_mutex' undeclared (first use in this function)
    mutex_lock(&memcg_limit_mutex);
  2. kabi checked failed (after fixing the problem 1 )
    
    *** ERROR - ABI BREAKAGE WAS DETECTED ***

The following symbols have been changed (this will cause an ABI breakage):

dev_get_stats invalidate_bdev scsi_host_alloc dev_addr_add __mmdrop ...

Other information:
1. Enviroment 
os: centos 7.4
kernel: kernel-3.10.0-693.el7.src.rpm

2. How to disable CONFIG_MEMCG_KMEM
I  disabled the CONFIG_MEMCG_KMEM by modifying the kernel.spec in kernel-3.10.0-693.el7.src.rpm and build the srpm package.

disbale memory cgroup kmem

for i in *.config do sed -i 's/CONFIG_MEMCG_KMEM=y/# CONFIG_MEMCG_KMEM is not set/' $i done

3. How to fix compile error
fix the problem 1 by this modification

diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 23e6528..50ae8fb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2962,6 +2962,8 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, memcg_check_events(memcg, page); }

+static DEFINE_MUTEX(memcg_limit_mutex); +

ifdef CONFIG_MEMCG_KMEM

static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg) { @@ -3436,8 +3438,6 @@ out: return new_cachep; }

-static DEFINE_MUTEX(memcg_limit_mutex);

void kmem_cache_destroy_memcg_children(struct kmem_cache s) { struct kmem_cache c;

Smana commented 6 years ago

We are facing the same issue there with a high load production kubernetes cluster. Is there a way to disable the kernel memory accounting at boot time (e.g. adding a boot parameter to grub) ?

Smana commented 6 years ago

We're currently reinstalling our oldest nodes with a kernel 4.15. Hope it will fix the issue definitely.

realxujiang commented 6 years ago

We're using CentOS 7, and run into a similar problem.

Environment:

What elegant solution?

chilicat commented 6 years ago

Same here, but I guess we only can wait for a newer rhel/centos kernel https://bugzilla.redhat.com/show_bug.cgi?id=1507149

pires commented 6 years ago

CentOS 7.5 fixes it as far as I can test.

chilicat commented 6 years ago

Kernel 3.10.0-862.el7.x86_64? Or is there a newer?

apatil commented 6 years ago

@pires Which kernel version?

ocofaigh commented 6 years ago

Reproduced on the latest kernel version (3.10.0-862.9.1.el7.x86_64) on RHEL 7.5 @pires Can you confirm your kernel version and if the issue is really fixed for you?