Closed plwhite closed 5 years ago
Peter, thanks for reporting. Was there an outcome of the discussion which gives a hint why this could be happening?
Sorry for the delay, missed this question.
It looks like (slightly speculatively) the 20M resource limit in the volumecontainerdisk container is too low, and the OOM killer destroys the container. I don't have any idea why I always see the OOM killer and it isn't seen in other environments, but I am running in Azure.
If it's easy to get a build with a larger (or configurable) limit, it would be pretty simple for me to test a fix in that environment.
Hey. it would be helpful if you can try to set a higher request for the container and see if it fixes your issue. There were some changes in the containerDisk area which could explain this.
I thought the request was hard-coded in KubeVirt? I tried just manually changing the pod spec (removing limits from the containers), and that changed the behaviour, in that the VM didn't start any more but it didn't hit the OOM killer while copying disk contents.
yes, memory request/limit for the containerdisk container is hardcoded We might have ~2~ 1 issue~s~ here:
the values are very likely too low for old containerdisks
oh, no, old containerdisks also use the new binary only
To progress this, is there some way I can change the hardcoded limit? Even if it's a bit of a hack for now, it might give some information about what is going on.
You could build KubeVirt on your own, you only need to have make
and docker
installed:
DOCKER_PREFIX="index.docker.io/<yourDockerUsername>" DOCKER_TAG="<aDockerTag>" make bazel-push-images manifests
That will build and push new images (make sure you are logged in in your docker account), and create manifests in _out/manifests/release
[0] https://github.com/kubevirt/kubevirt/blob/master/pkg/container-disk/container-disk.go#L126
Thanks - let me have a shot at that. I should get to that later in the week.
OK, so I changed that file and set all of the memory limits in the container to 200M and the CPU limits to 100m just for good measure. Having done so, I rebuilt and redeployed and can now create VMs based on container images - I've tried at least a dozen including some multi-GB images.
I have no idea what the actual memory limit should be in practice; for the monster images I am often working with 200M is a rounding error. Incidentally, this may be a silly proposal (I haven't got the backgorund), but I would have expected the volumecontainerdisk image to run as an init container and then terminate after the copy releasing that memory, in which case having a slightly over-large limit would matter much less.
Thanks a lot for the valueable feedback! I'm a bit busy myself this week, followed by 2 weeks vacation. Maybe @rmohr can follow up on this next week, when he is back from his vacation, he is very familiar with this topic.
Sure - happy to help, especially given that this is after all a bug that only I have come across! Do let me know if there's anything else I can do.
@plwhite could you please retest with https://github.com/kubevirt/kubevirt/pull/2687 and close if no longer reproducible?
Looks like #2687 fixes this.
(To be clear - I have tested and it no longer repros with the master build today, so definitely looks like a good fix.)
Great :) Thx for the feedback!
We saw this in 0.21.0. Was resolved after upgrading to 0.22.0.
Is this a BUG REPORT or FEATURE REQUEST?: /kind bug
What happened:
Using 0.20.1 on Azure, all VMIs failed. The volumecontainerdisk container was hitting the OOM killer 100% of the time.
What you expected to happen:
VMIs to start!
How to reproduce it (as minimally and precisely as possible):
Take a vanilla ubuntu 18.04 image, and create a VMI with a minimal config. It will fail if you are running in Azure (AKS). Dockerfiles and VMI spec included below for reference.
Anything else we need to know?:
Problem is specific to Azure; not seen elsewhere. Not a clue why, frankly, though if we are just running out of memory there are various possibilities.
Problem is specific to v0.20.1; does not repro in v0.19.
Problem manifests with either scratch containers or those based on
kubevirt/container-disk-v1alpha
If you create a VMI, then copy and hack the pod spec to remove the resource limits, then the OOM killer does not kick in; however the compute pod doesn't work either. This is not a supported thing to do but is suggestive that the underlying problem can be resolved with different limits for the volumecontainerimage pod.
Environment:
virtctl version
): v0.20.1kubectl version
): 1.13.9uname -a
): Don't know - getting access to the boxes is a pain, but could find out if it helps.Dockerfile (using completely vanilla downloaded ubuntu image)
Also tried this Dockerfile:
VMI spec I used
I've discussed on slack with @slintes, @fabiand and @rmohr. Many thanks to all of you for your help.