drbd sync fails on readded node

GoogleCodeExporter commented 9 years ago

What software version are you running? Please provide the output of "gnt-
cluster --version" and "gnt-cluster version".
gnt-cluster (ganeti v2.4.5) 2.4.5

Software version: 2.4.5
Internode protocol: 2040000
Configuration format: 2040000
OS api version: 20
Export interface: 0

What distribution are you using?
Gentoo and Ubuntu 12.04

both ganeti and drbd is the same version:
version: 8.3.11 (api:88/proto:86-96)
What steps will reproduce the problem?
1. Reinstalled node2
2. readded node2
3. gnt-instance replace-disks -s test1

What is the expected output? What do you see instead?

Please provide any additional information below.

Sync fails after some time syncing telling the disk are degraded.

gnt-instance replace-disks -s test1
Tue Sep 25 21:36:25 2012 Replacing disk(s) 0 for test1.lan
Tue Sep 25 21:36:36 2012 STEP 1/6 Check device existence
Tue Sep 25 21:36:36 2012  - INFO: Checking disk/0 on node1.lan
Tue Sep 25 21:36:40 2012  - INFO: Checking disk/0 on bestefar.ss.uib.no
Tue Sep 25 21:36:40 2012  - INFO: Checking volume groups
Tue Sep 25 21:36:42 2012 STEP 2/6 Check peer consistency
Tue Sep 25 21:36:42 2012  - INFO: Checking disk/0 consistency on node node1.lan
Failure: command execution error:
Node node1.lan has degraded storage, unsafe to replace disks for instance 
test1.lan

primary:
Sep 25 21:35:16 node1 ganeti-rapi[17004]: INFO 127.0.0.1:43227 GET /version 
HTTP/1.1 200
Sep 25 21:35:16 node1 ganeti-masterd[676]: (ClientReq1) INFO Received config 
values query request for ['master_node']
Sep 25 21:35:16 node1 ganeti-masterd[676]: (ClientReq5) INFO Received job 
autoarchive request for age 21600, timeout 29
Sep 25 21:35:16 node1 ganeti-masterd[676]: (ClientReq5) INFO Archiving jobs 
with age more than 21600 seconds
Sep 25 21:35:16 node1 ganeti-masterd[676]: (ClientReq15) INFO Received new job
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq11) INFO Received job poll 
request for 81612
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq7) INFO Received job poll 
request for 81612
Sep 25 21:35:17 node1 ganeti-masterd[676]: (JobQueue3/Job81612) INFO Op 1/2: 
opcode INSTANCE_QUERY waiting for locks
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq10) INFO Received job poll 
request for 81612
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq13) INFO Received job poll 
request for 81612
Sep 25 21:35:17 node1 ganeti-noded[17015]: INFO 192.168.1.35:51808 PUT 
/all_instances_info HTTP/1.1 200
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq3) INFO Received job poll 
request for 81612
Sep 25 21:35:17 node1 ganeti-masterd[676]: (JobQueue22/Job81612) INFO Op 2/2: 
opcode NODE_QUERY waiting for locks
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq4) INFO Received job poll 
request for 81612
Sep 25 21:35:18 node1 ganeti-masterd[676]: (ClientReq8) INFO Received job poll 
request for 81612
Sep 25 21:35:19 node1 ganeti-noded[17038]: INFO 192.168.1.35:51813 PUT 
/node_info HTTP/1.1 200
Sep 25 21:35:19 node1 ganeti-masterd[676]: (JobQueue22/Job81612) INFO Finished 
job 81612, status = success
Sep 25 21:35:19 node1 ganeti-masterd[676]: (ClientReq2) INFO Received job query 
request for 81612
Sep 25 21:35:19 node1 ganeti-masterd[676]: (ClientReq14) INFO Received job 
archive request for 81612
Sep 25 21:35:19 node1 ganeti-masterd[676]: (ClientReq14) INFO Archiving job 
81612
Sep 25 21:35:19 node1 ganeti-masterd[676]: (ClientReq12) INFO Received new job
Sep 25 21:35:20 node1 ganeti-masterd[676]: (ClientReq16) INFO Received job poll 
request for 81613
Sep 25 21:35:20 node1 ganeti-masterd[676]: (ClientReq9) INFO Received job poll 
request for 81613
Sep 25 21:35:20 node1 ganeti-masterd[676]: (JobQueue13/Job81613) INFO Op 1/1: 
opcode CLUSTER_VERIFY_DISKS waiting for locks
Sep 25 21:35:20 node1 ganeti-masterd[676]: (ClientReq6) INFO Received job poll 
request for 81613
Sep 25 21:35:20 node1 ganeti-masterd[676]: (ClientReq1) INFO Received job poll 
request for 81613
Sep 25 21:35:21 node1 ganeti-noded[17055]: INFO 192.168.1.35:51821 PUT /lv_list 
HTTP/1.1 200
Sep 25 21:35:21 node1 ganeti-masterd[676]: (JobQueue13/Job81613) INFO Finished 
job 81613, status = success
Sep 25 21:35:22 node1 ganeti-masterd[676]: (ClientReq5) INFO Received job query 
request for 81613
Sep 25 21:35:22 node1 ganeti-masterd[676]: (ClientReq15) INFO Received job 
archive request for 81613
Sep 25 21:35:22 node1 ganeti-masterd[676]: (ClientReq15) INFO Archiving job 
81613
Sep 25 21:36:24 node1 gnt-instance replace-disks[17172]: INFO run with 
arguments '-s test1'
Sep 25 21:36:24 node1 ganeti-masterd[676]: (MainThread) INFO Accepted 
connection from pid=17172, uid=0, gid=0
Sep 25 21:36:24 node1 ganeti-masterd[676]: (ClientReq11) INFO Received new job
Sep 25 21:36:24 node1 ganeti-masterd[676]: (ClientReq7) INFO Received job poll 
request for 81614
Sep 25 21:36:24 node1 ganeti-masterd[676]: (ClientReq10) INFO Received job poll 
request for 81614
Sep 25 21:36:25 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Op 1/1: 
opcode INSTANCE_REPLACE_DISKS(test1) waiting for locks
Sep 25 21:36:25 node1 ganeti-masterd[676]: (ClientReq13) INFO Received job poll 
request for 81614
Sep 25 21:36:25 node1 ganeti-masterd[676]: (ClientReq3) INFO Received job poll 
request for 81614
Sep 25 21:36:25 node1 ganeti-noded[17183]: INFO 192.168.1.35:51829 PUT 
/hooks_runner HTTP/1.1 200
Sep 25 21:36:25 node1 ganeti-masterd[676]: (ClientReq4) INFO Received job poll 
request for 81614
Sep 25 21:36:29 node1 kernel: [10648783.529804] block drbd1: Starting worker 
thread (from drbdsetup [17196])
Sep 25 21:36:29 node1 kernel: [10648783.529911] block drbd1: disk( Diskless -> 
Attaching ) 
Sep 25 21:36:29 node1 kernel: [10648783.531221] block drbd1: No usable activity 
log found.
Sep 25 21:36:29 node1 kernel: [10648783.531225] block drbd1: Method to ensure 
write ordering: flush
Sep 25 21:36:29 node1 kernel: [10648783.531232] block drbd1: drbd_bm_resize 
called with capacity == 41943040
Sep 25 21:36:29 node1 kernel: [10648783.531443] block drbd1: resync bitmap: 
bits=5242880 words=81920 pages=160
Sep 25 21:36:29 node1 kernel: [10648783.531447] block drbd1: size = 20 GB 
(20971520 KB)
Sep 25 21:36:29 node1 kernel: [10648783.536060] block drbd1: bitmap READ of 160 
pages took 5 jiffies
Sep 25 21:36:29 node1 kernel: [10648783.536797] block drbd1: recounting of set 
bits took additional 0 jiffies
Sep 25 21:36:29 node1 kernel: [10648783.536800] block drbd1: 17 GB (4388992 
bits) marked out-of-sync by on disk bit-map.
Sep 25 21:36:29 node1 kernel: [10648783.536807] block drbd1: disk( Attaching -> 
UpToDate ) pdsk( DUnknown -> Outdated ) 
Sep 25 21:36:29 node1 kernel: [10648783.536812] block drbd1: attached to UUIDs 
3534D005D75C18E0:0005000000000004:0004000000000004:0003000000000004
Sep 25 21:36:29 node1 kernel: [10648783.575210] block drbd1: conn( StandAlone 
-> Unconnected ) 
Sep 25 21:36:29 node1 kernel: [10648783.575239] block drbd1: Starting receiver 
thread (from drbd1_worker [17197])
Sep 25 21:36:29 node1 kernel: [10648783.575288] block drbd1: receiver 
(re)started
Sep 25 21:36:29 node1 kernel: [10648783.575297] block drbd1: conn( Unconnected 
-> WFConnection ) 
Sep 25 21:36:29 node1 ganeti-noded[17187]: INFO 192.168.1.35:51830 PUT 
/blockdev_assemble HTTP/1.1 200
Sep 25 21:36:30 node1 kernel: [10648784.383073] block drbd1: Handshake 
successful: Agreed network protocol version 96
Sep 25 21:36:30 node1 kernel: [10648784.383365] block drbd1: Peer authenticated 
using 16 bytes of 'md5' HMAC
Sep 25 21:36:30 node1 kernel: [10648784.383375] block drbd1: conn( WFConnection 
-> WFReportParams ) 
Sep 25 21:36:30 node1 kernel: [10648784.383396] block drbd1: Starting asender 
thread (from drbd1_receiver [17212])
Sep 25 21:36:30 node1 kernel: [10648784.383869] block drbd1: 
data-integrity-alg: <not-used>
Sep 25 21:36:30 node1 kernel: [10648784.383896] block drbd1: 
drbd_sync_handshake:
Sep 25 21:36:30 node1 kernel: [10648784.383900] block drbd1: self 
3534D005D75C18E0:0005000000000004:0004000000000004:0003000000000004 
bits:4388992 flags:0
Sep 25 21:36:30 node1 kernel: [10648784.383905] block drbd1: peer 
0005000000000004:0000000000000000:0000000000000000:0000000000000000 
bits:4388992 flags:0
Sep 25 21:36:30 node1 kernel: [10648784.383909] block drbd1: uuid_compare()=1 
by rule 70
Sep 25 21:36:30 node1 kernel: [10648784.383912] block drbd1: Becoming sync 
source due to disk states.
Sep 25 21:36:30 node1 kernel: [10648784.383922] block drbd1: peer( Unknown -> 
Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent 
) 
Sep 25 21:36:30 node1 kernel: [10648784.431059] block drbd1: helper command: 
/bin/true before-resync-source minor-1
Sep 25 21:36:30 node1 kernel: [10648784.431734] block drbd1: helper command: 
/bin/true before-resync-source minor-1 exit code 0 (0x0)
Sep 25 21:36:30 node1 kernel: [10648784.431741] block drbd1: conn( WFBitMapS -> 
SyncSource ) 
Sep 25 21:36:30 node1 kernel: [10648784.431751] block drbd1: Began resync as 
SyncSource (will sync 17555968 KB [4388992 bits set]).
Sep 25 21:36:30 node1 kernel: [10648784.431756] block drbd1: updated sync UUID 
3534D005D75C18E0:0006000000000004:0005000000000004:0004000000000004
Sep 25 21:36:36 node1 kernel: [10648790.672360] block drbd1: role( Secondary -> 
Primary ) 
Sep 25 21:36:36 node1 ganeti-noded[17222]: INFO 192.168.1.35:51832 PUT 
/blockdev_assemble HTTP/1.1 200
Sep 25 21:36:36 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Checking 
disk/0 on node1.lan
Sep 25 21:36:37 node1 ganeti-masterd[676]: (ClientReq8) INFO Received job poll 
request for 81614
Sep 25 21:36:37 node1 ganeti-masterd[676]: (ClientReq2) INFO Received job poll 
request for 81614
Sep 25 21:36:40 node1 ganeti-noded[17277]: INFO 192.168.1.35:51833 PUT 
/blockdev_find HTTP/1.1 200
Sep 25 21:36:40 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Checking 
disk/0 on bestefar.ss.uib.no
Sep 25 21:36:40 node1 ganeti-masterd[676]: (ClientReq14) INFO Received job poll 
request for 81614
Sep 25 21:36:40 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Checking 
volume groups
Sep 25 21:36:40 node1 ganeti-masterd[676]: (ClientReq12) INFO Received job poll 
request for 81614
Sep 25 21:36:42 node1 ganeti-noded[17288]: INFO 192.168.1.35:51835 PUT /vg_list 
HTTP/1.1 200
Sep 25 21:36:42 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Checking 
disk/0 consistency on node node1.lan
Sep 25 21:36:42 node1 ganeti-masterd[676]: (ClientReq16) INFO Received job poll 
request for 81614
Sep 25 21:36:42 node1 ganeti-masterd[676]: (ClientReq9) INFO Received job poll 
request for 81614
Sep 25 21:36:45 node1 ganeti-noded[17295]: INFO 192.168.1.35:51837 PUT 
/blockdev_find HTTP/1.1 200
Sep 25 21:36:45 node1 ganeti-noded[17303]: INFO 192.168.1.35:51838 PUT 
/instance_list HTTP/1.1 200
Sep 25 21:36:48 node1 kernel: [10648802.423076] block drbd1: role( Primary -> 
Secondary ) 
Sep 25 21:36:48 node1 kernel: [10648802.482407] block drbd1: peer( Secondary -> 
Unknown ) conn( SyncSource -> Disconnecting ) 
Sep 25 21:36:48 node1 kernel: [10648802.482514] block drbd1: asender terminated
Sep 25 21:36:48 node1 kernel: [10648802.482520] block drbd1: Terminating 
drbd1_asender
Sep 25 21:36:48 node1 kernel: [10648802.493169] block drbd1: bitmap WRITE of 
152 pages took 3 jiffies
Sep 25 21:36:48 node1 kernel: [10648802.516910] block drbd1: 16 GB (4112512 
bits) marked out-of-sync by on disk bit-map.
Sep 25 21:36:48 node1 kernel: [10648802.516922] block drbd1: Connection closed
Sep 25 21:36:48 node1 kernel: [10648802.516932] block drbd1: conn( 
Disconnecting -> StandAlone ) 
Sep 25 21:36:48 node1 kernel: [10648802.516954] block drbd1: receiver terminated
Sep 25 21:36:48 node1 kernel: [10648802.516957] block drbd1: Terminating 
drbd1_receiver
Sep 25 21:36:48 node1 kernel: [10648802.516997] block drbd1: disk( UpToDate -> 
Failed ) 
Sep 25 21:36:48 node1 kernel: [10648802.517023] block drbd1: Sending state for 
detaching disk failed
Sep 25 21:36:48 node1 kernel: [10648802.517037] block drbd1: disk( Failed -> 
Diskless ) 
Sep 25 21:36:48 node1 kernel: [10648802.517221] block drbd1: drbd_bm_resize 
called with capacity == 0
Sep 25 21:36:48 node1 kernel: [10648802.517270] block drbd1: worker terminated
Sep 25 21:36:48 node1 kernel: [10648802.517273] block drbd1: Terminating 
drbd1_worker
Sep 25 21:36:50 node1 ganeti-noded[17305]: INFO 192.168.1.35:51839 PUT 
/blockdev_shutdown HTTP/1.1 200
Sep 25 21:36:50 node1 ganeti-masterd[676]: (JobQueue25/Job81614) ERROR Op 1/1: 
Caught exception in INSTANCE_REPLACE_DISKS(test1)
Sep 25 21:36:50 node1 Traceback (most recent call last):
Sep 25 21:36:50 node1 File 
"/usr/lib64/python2.7/site-packages/ganeti/jqueue.py", line 936, in 
_ExecOpCodeUnlocked
Sep 25 21:36:50 node1 timeout=timeout, priority=op.priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", 
line 352, in ExecOpCode
Sep 25 21:36:50 node1 priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", 
line 296, in _LockAndExecLU
Sep 25 21:36:50 node1 result = self._LockAndExecLU(lu, level + 1, calc_timeout, 
priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", 
line 305, in _LockAndExecLU
Sep 25 21:36:50 node1 result = self._LockAndExecLU(lu, level + 1, calc_timeout, 
priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", 
line 296, in _LockAndExecLU
Sep 25 21:36:50 node1 result = self._LockAndExecLU(lu, level + 1, calc_timeout, 
priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", 
line 256, in _LockAndExecLU
Sep 25 21:36:50 node1 result = self._ExecLU(lu)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", 
line 231, in _ExecLU
Sep 25 21:36:50 node1 result = lu.Exec(self.Log)
Sep 25 21:36:50 node1 File 
"/usr/lib64/python2.7/site-packages/ganeti/cmdlib.py", line 270, in Exec
Sep 25 21:36:50 node1 tl.Exec(feedback_fn)
Sep 25 21:36:50 node1 File 
"/usr/lib64/python2.7/site-packages/ganeti/cmdlib.py", line 8365, in Exec
Sep 25 21:36:50 node1 result = fn(feedback_fn)
Sep 25 21:36:50 node1 File 
"/usr/lib64/python2.7/site-packages/ganeti/cmdlib.py", line 8536, in 
_ExecDrbd8DiskOnly
Sep 25 21:36:50 node1 False)
Sep 25 21:36:50 node1 File 
"/usr/lib64/python2.7/site-packages/ganeti/cmdlib.py", line 8431, in 
_CheckDisksConsistency
Sep 25 21:36:50 node1 (node_name, self.instance.name))
Sep 25 21:36:50 node1 OpExecError: Node node1.lan has degraded storage, unsafe 
to replace disks for instance test1.lan
Sep 25 21:36:50 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Finished 
job 81614, status = error
Sep 25 21:36:50 node1 ganeti-masterd[676]: (ClientReq6) INFO Received job query 
request for 81614
Sep 25 21:36:50 node1 gnt-instance replace-disks[17172]: ERROR Error during 
command processing
Sep 25 21:36:50 node1 Traceback (most recent call last):
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", 
line 1939, in GenericMain
Sep 25 21:36:50 node1 result = func(options, args)
Sep 25 21:36:50 node1 File 
"/usr/lib64/python2.7/site-packages/ganeti/client/gnt_instance.py", line 781, 
in ReplaceDisks
Sep 25 21:36:50 node1 SubmitOrSend(op, opts)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", 
line 1757, in SubmitOrSend
Sep 25 21:36:50 node1 return SubmitOpCode(op, cl=cl, feedback_fn=feedback_fn, 
opts=opts)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", 
line 1734, in SubmitOpCode
Sep 25 21:36:50 node1 reporter=reporter)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", 
line 1715, in PollJob
Sep 25 21:36:50 node1 return GenericPollJob(job_id, _LuxiJobPollCb(cl), 
reporter)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", 
line 1537, in GenericPollJob
Sep 25 21:36:50 node1 errors.MaybeRaise(msg)
Sep 25 21:36:50 node1 File 
"/usr/lib64/python2.7/site-packages/ganeti/errors.py", line 448, in MaybeRaise
Sep 25 21:36:50 node1 raise errcls, args
Sep 25 21:36:50 node1 OpExecError: Node node1.lan has degraded storage, unsafe 
to replace disks for instance test1.lan

secondary:
Sep 25 21:36:29 node2 kernel: [  597.351382] block drbd1: disk( Diskless -> 
Attaching ) 
Sep 25 21:36:29 node2 kernel: [  597.352712] block drbd1: No usable activity 
log found.
Sep 25 21:36:29 node2 kernel: [  597.352717] block drbd1: Method to ensure 
write ordering: flush
Sep 25 21:36:29 node2 kernel: [  597.352725] block drbd1: drbd_bm_resize called 
with capacity == 41943040
Sep 25 21:36:29 node2 kernel: [  597.352878] block drbd1: resync bitmap: 
bits=5242880 words=81920 pages=160
Sep 25 21:36:29 node2 kernel: [  597.352882] block drbd1: size = 20 GB 
(20971520 KB)
Sep 25 21:36:29 node2 kernel: [  597.355536] block drbd1: bitmap READ of 160 
pages took 1 jiffies
Sep 25 21:36:29 node2 kernel: [  597.355673] block drbd1: recounting of set 
bits took additional 0 jiffies
Sep 25 21:36:29 node2 kernel: [  597.355676] block drbd1: 17 GB (4388992 bits) 
marked out-of-sync by on disk bit-map.
Sep 25 21:36:29 node2 kernel: [  597.355682] block drbd1: disk( Attaching -> 
Inconsistent ) 
Sep 25 21:36:29 node2 kernel: [  597.355685] block drbd1: attached to UUIDs 
0005000000000004:0000000000000000:0000000000000000:0000000000000000
Sep 25 21:36:29 node2 kernel: [  597.395008] block drbd1: conn( StandAlone -> 
Unconnected ) 
Sep 25 21:36:29 node2 kernel: [  597.395058] block drbd1: Starting receiver 
thread (from drbd1_worker [3587])
Sep 25 21:36:29 node2 kernel: [  597.395300] block drbd1: receiver (re)started
Sep 25 21:36:29 node2 kernel: [  597.395308] block drbd1: conn( Unconnected -> 
WFConnection ) 
Sep 25 21:36:30 node2 kernel: [  597.895093] block drbd1: Handshake successful: 
Agreed network protocol version 96
Sep 25 21:36:30 node2 kernel: [  597.895335] block drbd1: Peer authenticated 
using 16 bytes of 'md5' HMAC
Sep 25 21:36:30 node2 kernel: [  597.895343] block drbd1: conn( WFConnection -> 
WFReportParams ) 
Sep 25 21:36:30 node2 kernel: [  597.895451] block drbd1: Starting asender 
thread (from drbd1_receiver [3601])
Sep 25 21:36:30 node2 kernel: [  597.895703] block drbd1: data-integrity-alg: 
<not-used>
Sep 25 21:36:30 node2 kernel: [  597.895720] block drbd1: drbd_sync_handshake:
Sep 25 21:36:30 node2 kernel: [  597.895724] block drbd1: self 
0005000000000004:0000000000000000:0000000000000000:0000000000000000 
bits:4388992 flags:0
Sep 25 21:36:30 node2 kernel: [  597.895729] block drbd1: peer 
3534D005D75C18E0:0005000000000004:0004000000000004:0003000000000004 
bits:4388992 flags:0
Sep 25 21:36:30 node2 kernel: [  597.895732] block drbd1: uuid_compare()=-1 by 
rule 50
Sep 25 21:36:30 node2 kernel: [  597.895735] block drbd1: Becoming sync target 
due to disk states.
Sep 25 21:36:30 node2 kernel: [  597.895742] block drbd1: peer( Unknown -> 
Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Sep 25 21:36:30 node2 kernel: [  597.941628] block drbd1: conn( WFBitMapT -> 
WFSyncUUID ) 
Sep 25 21:36:30 node2 kernel: [  597.985957] block drbd1: updated sync uuid 
0006000000000004:0000000000000000:0000000000000000:0000000000000000
Sep 25 21:36:30 node2 kernel: [  597.994754] block drbd1: helper command: 
/bin/true before-resync-target minor-1
Sep 25 21:36:30 node2 kernel: [  597.995590] block drbd1: helper command: 
/bin/true before-resync-target minor-1 exit code 0 (0x0)
Sep 25 21:36:30 node2 kernel: [  597.995597] block drbd1: conn( WFSyncUUID -> 
SyncTarget ) 
Sep 25 21:36:30 node2 kernel: [  597.995606] block drbd1: Began resync as 
SyncTarget (will sync 17555968 KB [4388992 bits set]).
Sep 25 21:36:36 node2 kernel: [  604.174503] block drbd1: peer( Secondary -> 
Primary ) 
Sep 25 21:36:48 node2 kernel: [  615.966592] block drbd1: peer( Primary -> 
Secondary ) 
Sep 25 21:36:48 node2 kernel: [  615.966815] block drbd1: peer( Secondary -> 
Unknown ) conn( SyncTarget -> TearDown ) pdsk( UpToDate -> DUnknown ) 
Sep 25 21:36:48 node2 kernel: [  615.967136] block drbd1: meta connection shut 
down by peer.
Sep 25 21:36:48 node2 kernel: [  615.967627] block drbd1: asender terminated
Sep 25 21:36:48 node2 kernel: [  615.967651] block drbd1: Terminating 
drbd1_asender
Sep 25 21:36:48 node2 kernel: [  615.977086] block drbd1: bitmap WRITE of 152 
pages took 2 jiffies
Sep 25 21:36:48 node2 kernel: [  615.997371] block drbd1: 16 GB (4112512 bits) 
marked out-of-sync by on disk bit-map.
Sep 25 21:36:48 node2 kernel: [  615.997392] block drbd1: Connection closed
Sep 25 21:36:48 node2 kernel: [  615.997398] block drbd1: conn( TearDown -> 
Unconnected ) 
Sep 25 21:36:48 node2 kernel: [  615.997404] block drbd1: receiver terminated
Sep 25 21:36:48 node2 kernel: [  615.997406] block drbd1: Restarting 
drbd1_receiver
Sep 25 21:36:48 node2 kernel: [  615.997409] block drbd1: receiver (re)started
Sep 25 21:36:48 node2 kernel: [  615.997414] block drbd1: conn( Unconnected -> 
WFConnection ) 
Sep 25 21:36:50 node2 kernel: [  618.163815] block drbd1: conn( WFConnection -> 
Disconnecting ) 
Sep 25 21:36:50 node2 kernel: [  618.163905] block drbd1: Discarding network 
configuration.
Sep 25 21:36:50 node2 kernel: [  618.164037] block drbd1: Connection closed
Sep 25 21:36:50 node2 kernel: [  618.164051] block drbd1: conn( Disconnecting 
-> StandAlone ) 
Sep 25 21:36:50 node2 kernel: [  618.164238] block drbd1: receiver terminated
Sep 25 21:36:50 node2 kernel: [  618.164243] block drbd1: Terminating 
drbd1_receiver
Sep 25 21:36:50 node2 kernel: [  618.164301] block drbd1: disk( Inconsistent -> 
Failed ) 
Sep 25 21:36:50 node2 kernel: [  618.164369] block drbd1: Sending state for 
detaching disk failed
Sep 25 21:36:50 node2 kernel: [  618.164848] block drbd1: disk( Failed -> 
Diskless ) 
Sep 25 21:36:50 node2 kernel: [  618.166200] block drbd1: drbd_bm_resize called 
with capacity == 0
Sep 25 21:36:50 node2 kernel: [  618.166243] block drbd1: worker terminated
Sep 25 21:36:50 node2 kernel: [  618.166246] block drbd1: Terminating 
drbd1_worker

Original issue reported on code.google.com by eyelessf...@gmail.com on 25 Sep 2012 at 7:51

GoogleCodeExporter commented 9 years ago

It seems to work with gnt-instance replace-disks -a test1

Original comment by eyelessf...@gmail.com on 25 Sep 2012 at 8:37

GoogleCodeExporter commented 9 years ago

Have you tried with --ignore-consistency as well? Are you sure the primary of 
the instance was correct?

Thanks,

Guido

Original comment by ultrot...@gmail.com on 9 Dec 2013 at 2:10

Changed state: NeedFeedback

GoogleCodeExporter commented 9 years ago

--ignore-consistency is not valid for replace-disks. -a fails for 

I am suffering from exactly the same problem on a production 2 node cluster I'm 
in the middle of upgrading. In my test environment (and in prod) we have 2 
Debian 6 squeeze with Ganeti 2.1, which we in turn upgraded to Debian 7 and 
Ganeti 2.9 from wheezy-backports.

Due to hardware changes on the platform, we have to rebuild each node in turn. 
This works perfectly in our test environment (virtualized), with the node A 
(master) running a dist-upgraded 7.3, Ganeti 2.9.2, and node B (slave) is 
running a freshly installed 7.3, Ganeti 2.9.2. KVM was also upgraded from 1.1.2 
to 1.7 on the A node to match KVM version on node B. DRBD is version 8.3.13 on 
A, 8.14 on B.

I simulated the entire rebuild with the above configuration yesterday, and now 
we are stuck as replace-disks fails in the manner described above.

On the test setup, replace-disks -s just worked.

On the prod setup, it failed:

OpExecError: Can't find disk/0 on node B: disk not found
Disks seem to be not properly activated. Try running activate-disks on the 
instance before using replace-disks.

So I did, it returned:
nodeA:disk/0:/dev/drbd3

Tried replace-disks again with -s:

Wed Jan 29 18:03:17 2014 STEP 2/6 Check peer consistency
Wed Jan 29 18:03:17 2014  - INFO: Checking disk/0 consistency on node A
Failure: command execution error:
Node A has degraded storage, unsafe to replace disks for instance instancename

Per the report above I tried with '-a':

Wed Jan 29 18:02:43 2014  - INFO: Checking disk/0 on B
Wed Jan 29 18:02:43 2014  - INFO: Checking disk/0 on A
Wed Jan 29 18:02:44 2014 No disks need replacement for instance 'instancename'

But clearly the disks are NOT in sync.

Any suggestions at this point ?

Original comment by regna...@gmail.com on 30 Jan 2014 at 2:04

GoogleCodeExporter commented 9 years ago

Forgot to include the log messages when running replace-disks -s:

2014-01-29 18:11:11,812: ganeti-masterd pid=9336/Jq9/Job394610/I_REPLACE_DISKS 
INFO Checking volume groups
2014-01-29 18:11:12,006: ganeti-masterd pid=9336/Jq9/Job394610/I_REPLACE_DISKS 
INFO Checking disk/0 consistency on node A
2014-01-29 18:11:12,232: ganeti-masterd pid=9336/Jq9/Job394610 ERROR Op 1/1: 
Caught exception in INSTANCE_REPLACE_DISKS(instancename)
Traceback (most recent call last):
  File "/usr/share/ganeti/ganeti/jqueue.py", line 1115, in _ExecOpCodeUnlocked
    timeout=timeout)
  File "/usr/share/ganeti/ganeti/jqueue.py", line 1426, in _WrapExecOpCode
    return execop_fn(op, *args, **kwargs)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 517, in ExecOpCode
    calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 407, in _LockAndExecLU
    result = self._ExecLU(lu)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 374, in _ExecLU
    result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
  File "/usr/share/ganeti/ganeti/cmdlib/base.py", line 250, in Exec
    tl.Exec(feedback_fn)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2158, in Exec
    result = fn(feedback_fn)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2347, in _ExecDrbd8DiskOnly
    False)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2235, in _CheckDisksConsistency
    self.instance.name))
OpExecError: Node A has degraded storage, unsafe to replace disks for instance 
instancename
2014-01-29 18:11:12,241: ganeti-masterd pid=9336/ClientReq3 INFO Received job 
poll request for 394610
2014-01-29 18:11:12,243: ganeti-masterd pid=9336/ClientReq4 INFO Received job 
poll request for 394610
2014-01-29 18:11:12,370: ganeti-masterd pid=9336/Jq9/Job394610 INFO Finished 
job 394610, status = error
2014-01-29 18:11:12,486: ganeti-masterd pid=9336/ClientReq5 INFO Received job 
query request for 394610

Original comment by regna...@gmail.com on 30 Jan 2014 at 2:12

GoogleCodeExporter commented 9 years ago

And some more traces. I tried reproducing this on node B by starting from 
scratch, and indeed:

# gnt-instance replace-disks -a instancename
Wed Jan 29 18:18:49 2014  - INFO: Checking disk/0 on B
Failure: prerequisites not met for this operation:
error type: wrong_state, error details:
Please run activate-disks on instance instancename first

# gnt-instance activate-disks instancename
nodeA:disk/0:/dev/drbd3

# gnt-instance replace-disks -a instancename
Wed Jan 29 18:19:55 2014  - INFO: Checking disk/0 on B
Wed Jan 29 18:19:55 2014  - INFO: Checking disk/0 on A
Wed Jan 29 18:19:56 2014 Replacing disk(s) 0 for instance 'instancename'
Wed Jan 29 18:19:56 2014 Current primary node: A
Wed Jan 29 18:19:56 2014 Current seconary node: B
Wed Jan 29 18:19:56 2014 STEP 1/6 Check device existence
Wed Jan 29 18:19:56 2014  - INFO: Checking disk/0 on A
Wed Jan 29 18:19:56 2014  - INFO: Checking disk/0 on B
Wed Jan 29 18:19:56 2014  - INFO: Checking volume groups
Wed Jan 29 18:19:56 2014 STEP 2/6 Check peer consistency
Wed Jan 29 18:19:56 2014  - INFO: Checking disk/0 consistency on node A
Wed Jan 29 18:19:57 2014 STEP 3/6 Allocate new storage
Wed Jan 29 18:19:57 2014  - INFO: Adding storage on B for disk/0
Wed Jan 29 18:19:58 2014 STEP 4/6 Changing drbd configuration
Wed Jan 29 18:19:58 2014  - INFO: Detaching disk/0 drbd from local storage
Wed Jan 29 18:19:58 2014  - INFO: Renaming the old LVs on the target node
Wed Jan 29 18:19:58 2014  - INFO: Renaming the new LVs on the target node
Wed Jan 29 18:19:59 2014  - INFO: Adding new mirror component on B
Wed Jan 29 18:20:01 2014 STEP 5/6 Sync devices
Wed Jan 29 18:20:01 2014  - INFO: Waiting for instance instancename to sync 
disks
Wed Jan 29 18:20:13 2014  - INFO: Instance instancename's disks are in sync

[here it seems to be running for about 10 seconds]

Failure: command execution error:
DRBD device disk/0 is degraded!

In the log, during the 10 second pause:

2014-01-29 18:20:01,574: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Waiting for instance instancename to sync disks
2014-01-29 18:20:01,819: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 10 retries left
2014-01-29 18:20:02,190: ganeti-masterd pid=9336/ClientReq1 INFO Received job 
poll request for 394618
2014-01-29 18:20:02,990: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 9 retries left
2014-01-29 18:20:04,192: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 8 retries left
2014-01-29 18:20:05,355: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 7 retries left
2014-01-29 18:20:06,527: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 6 retries left
2014-01-29 18:20:07,690: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 5 retries left
2014-01-29 18:20:08,855: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 4 retries left
2014-01-29 18:20:10,022: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 3 retries left
2014-01-29 18:20:11,184: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 2 retries left
2014-01-29 18:20:12,348: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Degraded disks found, 1 retries left
2014-01-29 18:20:13,507: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS 
INFO Instance instancename's disks are in sync
2014-01-29 18:20:13,710: ganeti-masterd pid=9336/Jq16/Job394618 ERROR Op 1/1: 
Caught exception in INSTANCE_REPLACE_DISKS(instancename)
Traceback (most recent call last):
  File "/usr/share/ganeti/ganeti/jqueue.py", line 1115, in _ExecOpCodeUnlocked
    timeout=timeout)
  File "/usr/share/ganeti/ganeti/jqueue.py", line 1426, in _WrapExecOpCode
    return execop_fn(op, *args, **kwargs)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 517, in ExecOpCode
    calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 407, in _LockAndExecLU
    result = self._ExecLU(lu)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 374, in _ExecLU
    result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
  File "/usr/share/ganeti/ganeti/cmdlib/base.py", line 250, in Exec
    tl.Exec(feedback_fn)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2158, in Exec
    result = fn(feedback_fn)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2453, in _ExecDrbd8DiskOnly
    self._CheckDevices(self.instance.primary_node, iv_names)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2300, in _CheckDevices
    raise errors.OpExecError("DRBD device %s is degraded!" % name)
OpExecError: DRBD device disk/0 is degraded!
2014-01-29 18:20:13,831: ganeti-masterd pid=9336/Jq16/Job394618 INFO Finished 
job 394618, status = error

Original comment by regna...@gmail.com on 30 Jan 2014 at 2:25

GoogleCodeExporter commented 9 years ago

Ok, so it seems I've found the origin of the problem. For some reason, even 
after doing rmmod / demod -a / modprobe, the settings in /etc/modules 
(including usermode_helper) hadn't been picked up.

Looking here 
https://groups.google.com/forum/#!msg/ganeti/7h1i-yWcp4s/FHQzVY_6DB0J I was 
reminded to check  /sys/module/drbd/parameters/, and indeed the helper was 
still /sbin/drbdadm. I rebooted the prod slave node, and now it's behaving as 
expected.

Still something odd: modinfo drbd tells me I'm running 8.3.11 drbd in the 
kernel, but the utils are 8.4:

ii  drbd8-utils                        2:8.4.4-1~bpo70+1             amd64

I'll ask about this on the discussion group.

Original comment by regna...@gmail.com on 30 Jan 2014 at 4:43

olopez32 / ganeti

drbd sync fails on readded node #270