kata-containers / qemu

Kata containers QEMU
Other
22 stars 19 forks source link

migration: add capability to bypass the shared memory #2

Closed bergwolf closed 6 years ago

bergwolf commented 6 years ago

1) What's this

When the migration capability 'bypass-shared-memory' is set, the shared memory will be bypassed when migration.

It is the key feature to enable several excellent features for the qemu, such as qemu-local-migration, qemu-live-update, extremely-fast-save-restore, vm-template, vm-fast-live-clone, yet-another-post-copy-migration, etc..

The philosophy behind this key feature, including the resulting advanced key features, is that a part of the memory management is separated out from the qemu, and let the other toolkits such as libvirt, kata-containers (https://github.com/kata-containers) runv(https://github.com/hyperhq/runv/) or some multiple cooperative qemu commands directly access to it, manage it, provide features on it.

2) Status in real world

The hyperhq(http://hyper.sh http://hypercontainer.io/) introduced the feature vm-template(vm-fast-live-clone) to the hyper container for several years, it works perfect. (see https://github.com/hyperhq/runv/pull/297).

The feature vm-template makes the containers(VMs) can be started in 130ms and save 80M memory for every container(VM). So that the hyper containers are fast and high-density as normal containers.

kata-containers project (https://github.com/kata-containers) which was launched by hyper, intel and friends and which descended from runv (and clear-container) should have this feature enabled. Unfortunately, due to the code confliction between runv&cc, this feature was temporary disabled and it is being brought back by hyper and intel team.

3) How to use and bring up advanced features.

In current qemu command line, shared memory has to be configured via memory-object.

a) feature: qemu-local-migration, qemu-live-update Set the mem-path on the tmpfs and set share=on for it when start the vm. example: -object \ memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \ -numa node,nodeid=0,cpus=0-7,memdev=mem

when you want to migrate the vm locally (after fixed a security bug of the qemu-binary, or other reason), you can start a new qemu with the same command line and -incoming, then you can migrate the vm from the old qemu to the new qemu with the migration capability 'bypass-shared-memory' set. The migration will migrate the device-state ONLY, the memory is the origin memory backed by tmpfs file.

b) feature: extremely-fast-save-restore the same above, but the mem-path is on the persistent file system.

c) feature: vm-template, vm-fast-live-clone the template vm is started as 1), and paused when the guest reaches the template point(example: the guest app is ready), then the template vm is saved. (the qemu process of the template can be killed now, because we need only the memory and the device state files (in tmpfs)).

Then we can launch one or multiple VMs base on the template vm states, the new VMs are started without the “share=on”, all the new VMs share the initial memory from the memory file, they save a lot of memory. all the new VMs start from the template point, the guest app can go to work quickly.

The new VM booted from template vm can’t become template again, if you need this unusual chained-template feature, you can write a cloneable-tmpfs kernel module for it.

The libvirt toolkit can’t manage vm-template currently, in the hyperhq/runv, we use qemu wrapper script to do it. I hope someone add “libvrit managed template” feature to libvirt.

d) feature: yet-another-post-copy-migration It is a possible feature, no toolkit can do it well now. Using nbd server/client on the memory file is reluctantly Ok but inconvenient. A special feature for tmpfs might be needed to fully complete this feature. No one need yet another post copy migration method, but it is possible when some crazy man need it.

Cc: Samuel Ortiz sameo@linux.intel.com Cc: Sebastien Boeuf sebastien.boeuf@intel.com Cc: James O. D. Hunt james.o.hunt@intel.com Cc: Xu Wang gnawux@gmail.com Cc: Peng Tao bergwolf@gmail.com Cc: Xiao Guangrong xiaoguangrong@tencent.com Cc: Xiao Guangrong xiaoguangrong.eric@gmail.com Signed-off-by: Lai Jiangshan jiangshanlai@gmail.com

bergwolf commented 6 years ago

ref: kata-containers/runtime/pull/303

bergwolf commented 6 years ago

cc @laijs

bergwolf commented 6 years ago

@devimc PR updated. PTAL.

jodh-intel commented 6 years ago

Apart from the performance improvement, is there a test to add to ensure this is DTRT?

lgtm.

/cc @anthonyzxu, @sboeuf, @markdryan, @rbradford.

bergwolf commented 6 years ago

@jodh-intel Any suggestions on how to add tests for it here? It seems there is no CI for the qemu repo.

Do we build qemu-lite from source in CI? If so, I can add some tests for vm factory in the tests repo and make it depend on this PR.

jodh-intel commented 6 years ago

@bergwolf - no, we don't any more: https://github.com/kata-containers/tests/blob/master/.ci/install_qemu.sh#L15.

Not a blocker and it doesn't even need to be in this repo but it would be good to be able to assert we're getting the expected behaviour.

bergwolf commented 6 years ago

@jodh-intel In that case, I would suggest merging this PR first and then I can start adding tests for vm factory in the tests repo that would rely on this patch. OTOH, the patch itself has been in production use on https://hyper.sh for more than two years so I'd say it's quite stable. WDYT?

jodh-intel commented 6 years ago

lgtm

I'll wait to see if we can get one more review on this PR today but your plan sounds good to me.

btw @chavafg - I think we should try to setup a CI for qemu at some point as we're modifying the code and it would be best if we could find issues with qemu here rather than when we test it in combination with other system elements.

/cc @grahamwhaley, @sboeuf.

bergwolf commented 6 years ago

@grahamwhaley Yes, this is being pushed to QEMU upstream.