Live migrations after unapplied hvparam changes can crash instance

GoogleCodeExporter commented 9 years ago

What software version are you running? Please provide the output of "gnt-
cluster --version", "gnt-cluster version", and "hspace --version".

What distribution are you using?
# gnt-cluster --version
gnt-cluster (ganeti v2.11.3) 2.11.3
# gnt-cluster version
Software version: 2.11.3
Internode protocol: 2110000
Configuration format: 2110000
OS api version: 20
Export interface: 0
VCS version: (ganeti) version v2.11.3
# hspace --version
hspace (ganeti) version v2.11.3
compiled with ghc 7.4
running on linux x86_64
# cat /etc/debian_version
7.6
# apt-cache policy ganeti
ganeti:
  Installed: 2.11.3-2~bpo70+1
  Candidate: 2.11.3-2~bpo70+1
  Package pin: 2.11.3-2~bpo70+1
  Version table:
 *** 2.11.3-2~bpo70+1 990
        100 http://debian.xxxxxxxx.de/debian/ wheezy-backports/main amd64 Packages
        100 /var/lib/dpkg/status
     2.10.5-1~bpo70+1 990
        100 http://debian.xxxxxxxx.de/debian/ wheezy-backports/main amd64 Packages
     2.9.5-1~bpo70+1 990
        100 http://debian.xxxxxxxx.de/debian/ wheezy-backports/main amd64 Packages

What steps will reproduce the problem?
1a. gnt-cluster modify -H 
kvm:spice_bind=0.0.0.0,spice_password_file=/etc/ganeti/auth/default
1b. gnt-instance modify -H 
spice_bind=0.0.0.0,spice_password_file=/etc/ganeti/auth/default web1
2.  gnt-instance migrate -f web1

What is the expected output? What do you see instead?
  Sun Jul 27 10:58:14 2014 Migrating instance zzzz.yyyyyy.xxxxxxxxx.de
  Sun Jul 27 10:58:14 2014 * checking disk consistency between source and target
  Sun Jul 27 10:58:17 2014 * switching node node2.yyyyyy.xxxxxxxxx.de to secondary mode
  Sun Jul 27 10:58:17 2014 * changing into standalone mode
  Sun Jul 27 10:58:22 2014 * changing disks into dual-master mode
  Sun Jul 27 10:58:28 2014 * wait until resync is done
  Sun Jul 27 10:58:31 2014 * preparing node2.yyyyyy.xxxxxxxxx.de to accept the instance
- Sun Jul 27 10:58:32 2014 Pre-migration failed, aborting
- Sun Jul 27 10:58:33 2014 * switching node node2.yyyyyy.xxxxxxxxx.de to 
secondary mode
- Sun Jul 27 10:58:34 2014 * changing into standalone mode
- Sun Jul 27 10:58:36 2014 * changing disks into single-master mode
- Sun Jul 27 10:58:39 2014 * wait until resync is done
- Failure: command execution error:
- Could not pre-migrate instance zzzz.yyyyyy.xxxxxxxxx.de: Failed to accept 
instance: kvm: error executing the set_password command: Device 'spice' has not 
been activated (DeviceNotActive):     

Please provide any additional information below.
As the VM is migrated over the new node tries to read the password file and 
then fails because spice hasn't been loaded

Original issue reported on code.google.com by neal.oa...@googlemail.com on 27 Jul 2014 at 9:38

GoogleCodeExporter commented 9 years ago

Original comment by hel...@google.com on 4 Aug 2014 at 9:02

Changed state: Accepted
Added labels: Milestone-Release2.13, Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

is there a reason why this will not be patched in 2.11?

Original comment by neal.oa...@googlemail.com on 4 Aug 2014 at 10:18

GoogleCodeExporter commented 9 years ago

The main reason is that we don't have enough people to fix everything right 
away. If someone sends a path *hint* *hint*, we could apply the fix earlier. ;)

Original comment by hel...@google.com on 4 Aug 2014 at 10:23

GoogleCodeExporter commented 9 years ago

Issue 930 has been merged into this issue.

Original comment by r...@google.com on 29 Aug 2014 at 11:11

GoogleCodeExporter commented 9 years ago

To summarize the issue merged in: this affects other parameters as well, 
notably the cpu_type.

A fix addressing this properly is beyond the scope of 2.11 - configuration 
changes or significant logic additions will be needed, far beyond what a stable 
release can take.
To compensate poorly, a warning will be added when changing the hvparams of an 
online instance, explicitly mentioning live migration.

A proper fix can be done in multiple ways:
- Add the current and future state of hvparams into the configuration
- Figure out the currently running parameters when live migrating
- Prevent live migration when no reboot has been done unless the user 
explicitly requests a live migration at his own risk

It should also be noted that the current need to supply the -f argument in 
every live migration is unfortunate as we cannot warn users about truly risky 
situations like this one.

Original comment by r...@google.com on 29 Aug 2014 at 11:42

Changed title: Live migrations after unapplied hvparam changes can crash instance
Added labels: Hypervisor-KVM, Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

In my opinion it would be best to have a check
if the VM would need a reboot (check if any "critical" settings have been 
changed)
and show a notice:
- at cluster verify
- and instance migration

Migrations should still run with old settings (maybe one could use 
/proc/[pid]/cmdline)

Original comment by neal.oa...@googlemail.com on 5 Sep 2014 at 7:59

GoogleCodeExporter commented 9 years ago

Original comment by pud...@google.com on 3 Jun 2015 at 1:45

Changed state: New
Added labels: Milestone-Unplanned, Priority-Medium
Removed labels: Milestone-Release2.13, Priority-High

olopez32 / ganeti

Live migrations after unapplied hvparam changes can crash instance #901