Improve bandwidth usage when migrating instances over WAN.

GoogleCodeExporter commented 9 years ago

We are performing some migrations (instance from ClusterA to ClusterB) which 
are moving 300GB of data across a WAN and tend to be very slow. It is normal 
for a user's instance to take 3 days to migrate, and watcher is locked while 
the migration is happening.

I am suggesting that the instance move, instead of using dd to move the data, 
could use a different way which would only transfer the actual amount of data 
the user is using:

* in the receiving end, create a directory structure that mimics the sending 
instance
* in the sending side:
 - rsync that data across the line
   - by creating a that tar.gz file and rsyncing it
   - by rsyncing the directory structure itself (without tar + gz)

or

* reserve some space under the sending and receiving node (equal to the max 
size of an instance, according to the policy)
* use export on that space
* transfer the data (rsync, for example)
* import the data on the other side

Original issue reported on code.google.com by clim...@google.com on 23 May 2013 at 7:00

GoogleCodeExporter commented 9 years ago

Hi Jesus,

This is definitely high priority in our roadmap. I am not entirely sure about 
the proposed approach though: considering Ganeti sees instances as black boxes, 
wouldn't "opening" that violate that requirement? Sure, we can use the os help 
for this, but that has definitely the opportunity of introducing more bugs, 
unfortunately.

Definitely we need to resolve the locking of the watcher during the migration, 
that is sure agreed.

thoughts?

Thanks,

Guido

Original comment by ultrot...@google.com on 24 May 2013 at 6:57

Added labels: Milestone-Release2.10, Priority-High

GoogleCodeExporter commented 9 years ago

Does the process as it stands now (using dd) use any kind of compression? Since 
the core of what Jesus is suggesting to is to avoid copying empty blocks, using 
compression should at least minimize the data representing those empty blocks, 
perhaps exponentially, and it would permit us to continue to use a file system 
agnostic approach.

Original comment by jpwoodbu@google.com on 6 Jun 2013 at 10:11

GoogleCodeExporter commented 9 years ago

On second thought, while compression would certainly help, with sufficient 
churn on the disk most of the free blocks won't be filled with zeros.

Original comment by jpwoodbu@google.com on 6 Jun 2013 at 10:17

GoogleCodeExporter commented 9 years ago

Original comment by ultrot...@google.com on 9 Jul 2013 at 2:49

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Whats about synchronisation of the blockdevices only for the changed parts?
Something like http://bdsync.rolf-fokkens.nl/

Original comment by tilo....@googlemail.com on 27 Jul 2013 at 10:00

GoogleCodeExporter commented 9 years ago

a bit off topic i guess but as for changed block device synchronization tools, 
lvmsync (https://github.com/mpalmer/lvmsync) seems also interesting and may be 
more efficient for lvm filesystem.

Original comment by informat...@gmail.com on 27 Jul 2013 at 12:10

GoogleCodeExporter commented 9 years ago

Everything which involves mounting instance filesystems on the host should be 
avoided. See [0], [1] or [2] for why that's insecure.
libguestfs ([3]) works around this issue by starting a tiny appliance in a VM 
and performs the mounting in this appliance. It understands quite some 
filesystems and can also help to sparsify disks (by using e.g. zerofree [4]).

I see two options how to speed up disk moves:
1)
 * Use libguestfs to mount all filesystems in a disk and zero the free blocks in them
 * Perform the move using dd | (gzip,bzip2)
2)
 * Use libguestfs to export a disk to a compressed sparse qcow2 image (this zeros free blocks as well)
 * Send the resulting image over using dd
 * Advantage: The resulting image should be smaller than what dd | (gzip,bzip2) produces (thanks to qcow2 understanding sparseness)
 * Drawback: Exporting the image takes time (we can't start sending data while it's exporting) and takes up to two times the space of the disk as temporary storage

If libguestfs is not available we still can use dd | (gzip,bzip2) instead of 
plain dd for sending the data. This will obviously send a lot of useless data 
in the process.
BTW, libguestfs is available for a couple of Linux distributions, and is 
available in Debian as of squeeze-backports.

What do you think?

[0]: http://libguestfs.org/guestfs.3.html#security-of-mounting-filesystems
[1]: http://lwn.net/Articles/538898/
[2]: 
https://www.berrange.com/posts/2013/02/20/a-reminder-why-you-should-never-mount-
guest-disk-images-on-the-host-os/
[3]: http://libguestfs.org/
[4]: http://manpages.ubuntu.com/manpages/lucid/man8/zerofree.8.html

Original comment by thoma...@google.com on 30 Sep 2013 at 12:57

GoogleCodeExporter commented 9 years ago

Original comment by thoma...@google.com on 30 Sep 2013 at 1:09

GoogleCodeExporter commented 9 years ago

I conducted benchmarks regarding various instance move strategies. I used one 
300GB disk with just a minimal Debian OS installed on it. Data was not actually 
sent over the network, but I only exported the disk to an image file (on the 
same physical disk). So timing values are not really meaningful, but the 
resulting image size is.
The strategies I tested are:

 - dd:                                 simple `dd if=<block dev> of=<img file> bs=1M`
 - dd | gzip:                          same as dd, but piped through gzip
 - dd | gzip --fast:                   same as dd | gzip, but using the --fast option of gzip
 - dd | bzip2:                         same as dd, but piped through bzip2
 - virt-sparsify:                      use virt-sparsify to create a compressed sparse qcow2 image
   Steps performed by virt-sparsify:
    * Launch small VM using KVM (no hardware acceleration on dom0's...)
    * Create overlay QEMU image, so no write access actually goes to the original disk (requires a lot of temporary space)
    * Mount all filesystems of the disk
    * Fill free blocks with zeros (something like `dd if=/dev/zero of=/tmp/zeros; rm /tmp/zeros`) (writes ~295GB to overlay disk)
    * Unmount filesystems
    * Calls something like `qemu-img convert -c -O qcow2 <overlay image> <destination image>`
 - fill zeros in libguestfs, qemu-img: use libguestfs to fill free space in image with zeros, then qemu-img to create sparse compressed image
   This essentially performs the same steps as virt-sparsify, but does not create an overlay image to "protect" the instance disk.
 - fill zeros in VM, qemu-img:         Free blocks are zeroed from within the VM
 - zerofree in VM, qemu-img:           Call zerofree in VM (requires RO mount of the file systems)
 - fill zeros on host, qemu-img:       Mount file systems on host (insecure), fill with zeros there

Part of those strategies were tested in three different constellations (see 
attached diagrams)

 - zerod_disk.png:      The disk was zeroed before handing it to the OS installation scripts, and only the OS was installed on it
 - random_disk.png:     Free blocks of the disk were filled with random data. The actually used amount of data was only the OS. Note that the dd | bzip2 benchmark didn't run through, as it would have taken too long, so the numbers are extrapolations.
 - kvm_random_disk.png: As libguestfs uses KVM to start its appliance, performance was tested on a machine which had hardware virtualization support (unlike the dom0 domains used in the other tests). Timings are not comparable to the other two constellations, as the machine was a different one.

The bars in the diagram show:

 - time: The total time it took to export the disk to a disk image (on the same physical disk, so this time is not quite meaningful)
 - size: The size of the resulting image
 - est:  The estimated time of an instance move with a throughput of 30 MB/s (encryption + network speed). For the dd-based strategies, that's size/30. The qemu-img based strategies require to store the image on the host (qcow2 requires random access while writing), so there it's time + size/30.

One quick note about throughput: Make sure to use '--enable-socat-compress' 
during `./configure` and a socat version which supports it (see the socat note 
in INSTALL), otherwise the throughput during instance moves will suffer quite a 
bit.

Performance-wise I'm leaning towards zeroing free blocks in a VM, but running 
it in the hypervisor which is available on the host (so no fully emulated KVM 
on dom0's). It would be preferable to build on libguestfs, as they have put a 
lot of effort in auto-detecting OS's, file systems and so on, but we're not 
sure if that's doable with Xen, for example.

Any thoughts? Comments? Strategies I missed to benchmark?

Original comment by thoma...@google.com on 11 Oct 2013 at 9:09

Added labels: Milestone-Release2.11
Removed labels: Milestone-Release2.10

Attachments:

GoogleCodeExporter commented 9 years ago

Assigning to Riba, he's working on this.

Original comment by thoma...@google.com on 2 Apr 2014 at 7:23

olopez32 / ganeti

Improve bandwidth usage when migrating instances over WAN. #473