Open GoogleCodeExporter opened 9 years ago
It seems to work with gnt-instance replace-disks -a test1
Original comment by eyelessf...@gmail.com
on 25 Sep 2012 at 8:37
Have you tried with --ignore-consistency as well? Are you sure the primary of
the instance was correct?
Thanks,
Guido
Original comment by ultrot...@gmail.com
on 9 Dec 2013 at 2:10
--ignore-consistency is not valid for replace-disks. -a fails for
I am suffering from exactly the same problem on a production 2 node cluster I'm
in the middle of upgrading. In my test environment (and in prod) we have 2
Debian 6 squeeze with Ganeti 2.1, which we in turn upgraded to Debian 7 and
Ganeti 2.9 from wheezy-backports.
Due to hardware changes on the platform, we have to rebuild each node in turn.
This works perfectly in our test environment (virtualized), with the node A
(master) running a dist-upgraded 7.3, Ganeti 2.9.2, and node B (slave) is
running a freshly installed 7.3, Ganeti 2.9.2. KVM was also upgraded from 1.1.2
to 1.7 on the A node to match KVM version on node B. DRBD is version 8.3.13 on
A, 8.14 on B.
I simulated the entire rebuild with the above configuration yesterday, and now
we are stuck as replace-disks fails in the manner described above.
On the test setup, replace-disks -s just worked.
On the prod setup, it failed:
OpExecError: Can't find disk/0 on node B: disk not found
Disks seem to be not properly activated. Try running activate-disks on the
instance before using replace-disks.
So I did, it returned:
nodeA:disk/0:/dev/drbd3
Tried replace-disks again with -s:
Wed Jan 29 18:03:17 2014 STEP 2/6 Check peer consistency
Wed Jan 29 18:03:17 2014 - INFO: Checking disk/0 consistency on node A
Failure: command execution error:
Node A has degraded storage, unsafe to replace disks for instance instancename
Per the report above I tried with '-a':
Wed Jan 29 18:02:43 2014 - INFO: Checking disk/0 on B
Wed Jan 29 18:02:43 2014 - INFO: Checking disk/0 on A
Wed Jan 29 18:02:44 2014 No disks need replacement for instance 'instancename'
But clearly the disks are NOT in sync.
Any suggestions at this point ?
Original comment by regna...@gmail.com
on 30 Jan 2014 at 2:04
Forgot to include the log messages when running replace-disks -s:
2014-01-29 18:11:11,812: ganeti-masterd pid=9336/Jq9/Job394610/I_REPLACE_DISKS
INFO Checking volume groups
2014-01-29 18:11:12,006: ganeti-masterd pid=9336/Jq9/Job394610/I_REPLACE_DISKS
INFO Checking disk/0 consistency on node A
2014-01-29 18:11:12,232: ganeti-masterd pid=9336/Jq9/Job394610 ERROR Op 1/1:
Caught exception in INSTANCE_REPLACE_DISKS(instancename)
Traceback (most recent call last):
File "/usr/share/ganeti/ganeti/jqueue.py", line 1115, in _ExecOpCodeUnlocked
timeout=timeout)
File "/usr/share/ganeti/ganeti/jqueue.py", line 1426, in _WrapExecOpCode
return execop_fn(op, *args, **kwargs)
File "/usr/share/ganeti/ganeti/mcpu.py", line 517, in ExecOpCode
calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 407, in _LockAndExecLU
result = self._ExecLU(lu)
File "/usr/share/ganeti/ganeti/mcpu.py", line 374, in _ExecLU
result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
File "/usr/share/ganeti/ganeti/cmdlib/base.py", line 250, in Exec
tl.Exec(feedback_fn)
File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2158, in Exec
result = fn(feedback_fn)
File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2347, in _ExecDrbd8DiskOnly
False)
File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2235, in _CheckDisksConsistency
self.instance.name))
OpExecError: Node A has degraded storage, unsafe to replace disks for instance
instancename
2014-01-29 18:11:12,241: ganeti-masterd pid=9336/ClientReq3 INFO Received job
poll request for 394610
2014-01-29 18:11:12,243: ganeti-masterd pid=9336/ClientReq4 INFO Received job
poll request for 394610
2014-01-29 18:11:12,370: ganeti-masterd pid=9336/Jq9/Job394610 INFO Finished
job 394610, status = error
2014-01-29 18:11:12,486: ganeti-masterd pid=9336/ClientReq5 INFO Received job
query request for 394610
Original comment by regna...@gmail.com
on 30 Jan 2014 at 2:12
And some more traces. I tried reproducing this on node B by starting from
scratch, and indeed:
# gnt-instance replace-disks -a instancename
Wed Jan 29 18:18:49 2014 - INFO: Checking disk/0 on B
Failure: prerequisites not met for this operation:
error type: wrong_state, error details:
Please run activate-disks on instance instancename first
# gnt-instance activate-disks instancename
nodeA:disk/0:/dev/drbd3
# gnt-instance replace-disks -a instancename
Wed Jan 29 18:19:55 2014 - INFO: Checking disk/0 on B
Wed Jan 29 18:19:55 2014 - INFO: Checking disk/0 on A
Wed Jan 29 18:19:56 2014 Replacing disk(s) 0 for instance 'instancename'
Wed Jan 29 18:19:56 2014 Current primary node: A
Wed Jan 29 18:19:56 2014 Current seconary node: B
Wed Jan 29 18:19:56 2014 STEP 1/6 Check device existence
Wed Jan 29 18:19:56 2014 - INFO: Checking disk/0 on A
Wed Jan 29 18:19:56 2014 - INFO: Checking disk/0 on B
Wed Jan 29 18:19:56 2014 - INFO: Checking volume groups
Wed Jan 29 18:19:56 2014 STEP 2/6 Check peer consistency
Wed Jan 29 18:19:56 2014 - INFO: Checking disk/0 consistency on node A
Wed Jan 29 18:19:57 2014 STEP 3/6 Allocate new storage
Wed Jan 29 18:19:57 2014 - INFO: Adding storage on B for disk/0
Wed Jan 29 18:19:58 2014 STEP 4/6 Changing drbd configuration
Wed Jan 29 18:19:58 2014 - INFO: Detaching disk/0 drbd from local storage
Wed Jan 29 18:19:58 2014 - INFO: Renaming the old LVs on the target node
Wed Jan 29 18:19:58 2014 - INFO: Renaming the new LVs on the target node
Wed Jan 29 18:19:59 2014 - INFO: Adding new mirror component on B
Wed Jan 29 18:20:01 2014 STEP 5/6 Sync devices
Wed Jan 29 18:20:01 2014 - INFO: Waiting for instance instancename to sync
disks
Wed Jan 29 18:20:13 2014 - INFO: Instance instancename's disks are in sync
[here it seems to be running for about 10 seconds]
Failure: command execution error:
DRBD device disk/0 is degraded!
In the log, during the 10 second pause:
2014-01-29 18:20:01,574: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Waiting for instance instancename to sync disks
2014-01-29 18:20:01,819: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 10 retries left
2014-01-29 18:20:02,190: ganeti-masterd pid=9336/ClientReq1 INFO Received job
poll request for 394618
2014-01-29 18:20:02,990: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 9 retries left
2014-01-29 18:20:04,192: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 8 retries left
2014-01-29 18:20:05,355: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 7 retries left
2014-01-29 18:20:06,527: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 6 retries left
2014-01-29 18:20:07,690: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 5 retries left
2014-01-29 18:20:08,855: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 4 retries left
2014-01-29 18:20:10,022: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 3 retries left
2014-01-29 18:20:11,184: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 2 retries left
2014-01-29 18:20:12,348: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Degraded disks found, 1 retries left
2014-01-29 18:20:13,507: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS
INFO Instance instancename's disks are in sync
2014-01-29 18:20:13,710: ganeti-masterd pid=9336/Jq16/Job394618 ERROR Op 1/1:
Caught exception in INSTANCE_REPLACE_DISKS(instancename)
Traceback (most recent call last):
File "/usr/share/ganeti/ganeti/jqueue.py", line 1115, in _ExecOpCodeUnlocked
timeout=timeout)
File "/usr/share/ganeti/ganeti/jqueue.py", line 1426, in _WrapExecOpCode
return execop_fn(op, *args, **kwargs)
File "/usr/share/ganeti/ganeti/mcpu.py", line 517, in ExecOpCode
calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout)
File "/usr/share/ganeti/ganeti/mcpu.py", line 407, in _LockAndExecLU
result = self._ExecLU(lu)
File "/usr/share/ganeti/ganeti/mcpu.py", line 374, in _ExecLU
result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
File "/usr/share/ganeti/ganeti/cmdlib/base.py", line 250, in Exec
tl.Exec(feedback_fn)
File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2158, in Exec
result = fn(feedback_fn)
File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2453, in _ExecDrbd8DiskOnly
self._CheckDevices(self.instance.primary_node, iv_names)
File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2300, in _CheckDevices
raise errors.OpExecError("DRBD device %s is degraded!" % name)
OpExecError: DRBD device disk/0 is degraded!
2014-01-29 18:20:13,831: ganeti-masterd pid=9336/Jq16/Job394618 INFO Finished
job 394618, status = error
Original comment by regna...@gmail.com
on 30 Jan 2014 at 2:25
Ok, so it seems I've found the origin of the problem. For some reason, even
after doing rmmod / demod -a / modprobe, the settings in /etc/modules
(including usermode_helper) hadn't been picked up.
Looking here
https://groups.google.com/forum/#!msg/ganeti/7h1i-yWcp4s/FHQzVY_6DB0J I was
reminded to check /sys/module/drbd/parameters/, and indeed the helper was
still /sbin/drbdadm. I rebooted the prod slave node, and now it's behaving as
expected.
Still something odd: modinfo drbd tells me I'm running 8.3.11 drbd in the
kernel, but the utils are 8.4:
ii drbd8-utils 2:8.4.4-1~bpo70+1 amd64
I'll ask about this on the discussion group.
Original comment by regna...@gmail.com
on 30 Jan 2014 at 4:43
Original issue reported on code.google.com by
eyelessf...@gmail.com
on 25 Sep 2012 at 7:51