vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 173 forks source link

Endpoint VM should not be a balloon target - OOM problems #6546

Open corrieb opened 7 years ago

corrieb commented 7 years ago

User Statement:

As a vSphere admin I may want to set resource limits around a VCH. If containerVMs start to exceed these limits, I would expect those VMs to experience some balloon/host swap, but I would not want the control plane impacted.

Details:

Very strange problem today and took me a while to figure out what it was. I was getting no space left on device during a docker pull operation. The endpoint VM seemed to have plenty of disk space, but I couldn't run top or any other operation because there was simply no memory left.

root@ [ /var/log/vic ]# top
-bash: fork: Cannot allocate memory

I re-ran my scenario and eventually reproduced the problem. Running top showed that memory usage had grown to 90-95% of the VM all of a sudden and none of that usage could be tied to any of the endpoint VM services.

As it happens, what I was seeing was ballooning in the endpoint VM. As a VM that typically sees low CPU and memory utilization, it will be targeted more aggressively for reclamation than the container VMs and the net result? It ran tmpfs out of space and eventually you couldn't even ssh into the endpoint VM.

Acceptance Criteria:

We need to protect the endpoint VM from ballooning at the very least. A more aggressive strategy would be to protect it from any form of memory reclamation by setting a memory reservation on it out-of-the-box.

This is a relatively urgent change, because I suspect we'll see this problem in the field before too long... and the symptoms are hard to debug.

hickeng commented 7 years ago

@corrieb I'm not sure how ballooning is occurring - the endpoint doesn't have a balloon driver unless Photon have rolled it into the core kernel. Are you seeing ballooning, or just balloon target?

mdubya66 commented 6 years ago

@corrieb or @hickeng status on this?

hickeng commented 6 years ago

@corrieb have you seen this again? @mhagen-vmware we could add a vsphere stats check to assert no ballooning is occurring in the tests. Should be done in a wrapper to Container Remove keyword

cgtexmex commented 6 years ago

Why is this considered a p0? Is it for investigation? Have we seen it repeatedly?

mdubya66 commented 6 years ago

Possible seen here: https://vmwarecode.slack.com/archives/C293W9V0A/p1524500312000627

Hey all - i'm running into an error with my container deployments (yet again - i'm gonna get this working if it kills me. suddenly when trying to pull images i'm getting hte following:

PS C:\Users\jfeeser> docker -H 10.100.5.61:2376 --tls run --name nginx-proxy -v nginx-proxy-persist/var/www:/usr/share/nginx/html:ro -v nginx-proxy-persist/var/nginx/conf:/etc/nginx:ro -P -d nginx
Unable to find image 'nginx:latest' locally
latest: Pulling from library/nginx
2a72cbf407d6: Downloading  22.49MB/22.49MB
a3ed95caeb02: Download complete
04b2d3302d48: Verifying Checksum
e7f619103861: Download complete
C:\Program Files\Docker\Docker\Resources\bin\docker.exe: library/nginx/534bc4991cb28264154568020aedc2f3c6f1e4ca9758ef4a9fa86125154bb33f returned write /tmp/https/registry.hub.docker.com/v2/library/nginx/latest/534bc4991cb28264154568020aedc2f3c6f1e4ca9758ef4a9fa86125154bb33f/534bc4991cb28264154568020aedc2f3c6f1e4ca9758ef4a9fa86125154bb33f.json: no space left on device.

i'm not sure which "device" it's talking about - the datastore where the volumes live has plenty of space (over a TB), so i'm guessing there's a temp folder somewhere that's filling up but i have no idea where to go to clean that.

hickeng commented 6 years ago

@mdubya66 Slack is unavailable for me currently - this is unlikely to be related to ballooning as: a. there's no balloon driver in the guest b. the balloon should automatically deflate even if there were.

It could be related to #6093 which I've never triaged. I do not know how much space is consumed by these temporary entries.

corrieb commented 6 years ago

The balloon driver absolutely is active in the Endpoint VM - confirmed with 1.3.1. Confirmed not just from the vSphere perspective, but also using top, which clearly shows the memory being eaten up by the balloon.

screen shot 2018-04-25 at 4 00 28 pm

corrieb commented 6 years ago

@hickeng The issue above could simply be the amount of memory available in the endpoint VM if the user is trying to pull an image with very large layers, right? If the Docker client says that there's no space left on device in the example where the balloon driver restricts available memory, it stands to reason it would do the same thing if memory is restricted for other reasons.

Note that the balloon driver only deflates when enough memory becomes active, so it could easily fail to allocate mem before the deflation starts.

corrieb commented 6 years ago

To confirm, I just tried pulling tomcat with a balloon inflated in the endpoint VM. This is entirely consistent with what I originally reported. Below is the output, which appears to be the symptom the user has observed. Only difference seems to be this ran out of space while downloading, while theirs ran out of space while extracting.

@mdubya66 I suggest you point the user to this bug and suggest that they double the memory of their endpoint VM and try again. They should also check for ballooning if they can.

vagrant@ubuntu-1604-vmware:~/vic/src/github.com/vmware/dev-dock$ docker pull tomcat
Using default tag: latest
latest: Pulling from library/tomcat
c73ab1c6897b: Pull complete 
a3ed95caeb02: Pull complete 
1ab373b3deae: Extracting [====================================>              ]  8.126MB/11.11MB
b542772b4177: Download complete 
0bcc3741ab14: Download complete 
421d624d778d: Download complete 
26ad58237506: Download complete 
8dbabc90b2b8: Downloading [==================================================>]  155.2MB/155.2MB
982930be204d: Download complete 
80869be51738: Download complete 
b71ce0f0260c: Download complete 
b18814a5c704: Download complete 
444f958494eb: Download complete 
6f92b6053b75: Download complete 
library/tomcat/cadc4701ed47bb0dca90f5936a1c6757286dd49d3c58823099545fab931daa8a returned download failed: write /tmp/8dbabc90b2b8367078151: no space left on device