Open GoogleCodeExporter opened 9 years ago
The ganeti error:
[17:48] 1 root@hp-vh-1 /root $ gnt-instance add -t drbd -o image+default -s 12g
-n hp-vh-1.zogi.local:hp-vh-2.zogi.local hp-sv-5.zogi.local
Tue Apr 1 17:48:42 2014 * creating instance disks...
Tue Apr 1 17:48:44 2014 adding instance hp-sv-5.zogi.local to cluster config
Tue Apr 1 17:48:44 2014 - INFO: Waiting for instance hp-sv-5.zogi.local to
sync disks
Tue Apr 1 17:48:55 2014 - INFO: Instance hp-sv-5.zogi.local's disks are in
sync
Failure: command execution error:
There are some degraded disks for this instance
and after that I get a ressource 0: cs:Unconfigured in /proc/drbd
$ cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: C0678379139A852E8263B22
0: cs:Unconfigured
Original comment by daniel.c...@gmail.com
on 1 Apr 2014 at 3:50
Since I completly reset everything (vgremove, vgcreate, gnt-cluster init, etc.)
I'm not really sure if it has something to do with the downgrade. It's probably
a drbd issue.
Original comment by daniel.c...@gmail.com
on 1 Apr 2014 at 3:56
It really looks like a DRBD problem. Did you try to set up a DRBD device
without Ganeti (see http://www.drbd.org/users-guide-8.3/)?
Also, it would be interesting to see what the DRBD output on the other node is
(hp-vh-2 in your case, I guess).
Original comment by thoma...@google.com
on 2 Apr 2014 at 7:01
I haven't tried that (set up DRBD without ganeti) yet. Will do it later and
post the results. Here is the output on both nodes (dmesg) while I did an
gnt-instance add ... -n hp-vh-1:hp-vh-2 ..
hp-vh-1:
Apr 2 10:07:50 hp-vh-1 kernel: [65190.564266] block drbd0: Starting worker
thread (from drbdsetup [11161])
Apr 2 10:07:50 hp-vh-1 kernel: [65190.564391] block drbd0: disk( Diskless ->
Attaching )
Apr 2 10:07:50 hp-vh-1 kernel: [65190.564796] block drbd0: No usable activity
log found.
Apr 2 10:07:50 hp-vh-1 kernel: [65190.564799] block drbd0: Method to ensure
write ordering: flush
Apr 2 10:07:50 hp-vh-1 kernel: [65190.564805] block drbd0: drbd_bm_resize
called with capacity == 25165824
Apr 2 10:07:50 hp-vh-1 kernel: [65190.564889] block drbd0: resync bitmap:
bits=3145728 words=49152 pages=96
Apr 2 10:07:50 hp-vh-1 kernel: [65190.564892] block drbd0: size = 12 GB
(12582912 KB)
Apr 2 10:07:50 hp-vh-1 kernel: [65190.564895] block drbd0: Writing the whole
bitmap, size changed
Apr 2 10:07:50 hp-vh-1 kernel: [65190.565765] block drbd0: bitmap WRITE of 96
pages took 1 jiffies
Apr 2 10:07:50 hp-vh-1 kernel: [65190.565792] block drbd0: 12 GB (3145728
bits) marked out-of-sync by on disk bit-map.
Apr 2 10:07:50 hp-vh-1 kernel: [65190.566300] block drbd0: bitmap READ of 96
pages took 0 jiffies
Apr 2 10:07:50 hp-vh-1 kernel: [65190.566374] block drbd0: recounting of set
bits took additional 0 jiffies
Apr 2 10:07:50 hp-vh-1 kernel: [65190.566376] block drbd0: 12 GB (3145664
bits) marked out-of-sync by on disk bit-map.
Apr 2 10:07:50 hp-vh-1 kernel: [65190.566380] block drbd0: disk( Attaching ->
Inconsistent )
Apr 2 10:07:50 hp-vh-1 kernel: [65190.566383] block drbd0: attached to UUIDs
0000000000000004:0000000000000000:0000000000000000:0000000000000000
Apr 2 10:07:50 hp-vh-1 kernel: [65190.573624] block drbd0: conn( StandAlone ->
Unconnected )
Apr 2 10:07:50 hp-vh-1 kernel: [65190.573638] block drbd0: Starting receiver
thread (from drbd0_worker [11162])
Apr 2 10:07:50 hp-vh-1 kernel: [65190.573674] block drbd0: receiver (re)started
Apr 2 10:07:50 hp-vh-1 kernel: [65190.573679] block drbd0: conn( Unconnected
-> WFConnection )
Apr 2 10:07:51 hp-vh-1 kernel: [65190.653391] block drbd0: role( Secondary ->
Primary ) disk( Inconsistent -> UpToDate )
Apr 2 10:07:51 hp-vh-1 kernel: [65190.653517] block drbd0: Forced to consider
local data as UpToDate!
Apr 2 10:07:51 hp-vh-1 kernel: [65190.653532] block drbd0: new current UUID
B3BAB3C2E93C1F63:0000000000000004:0000000000000000:0000000000000000
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100196] block drbd0: role( Primary ->
Secondary )
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100218] block drbd0: bitmap WRITE of 0
pages took 0 jiffies
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100278] block drbd0: 12 GB (3145664
bits) marked out-of-sync by on disk bit-map.
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100524] block drbd0: conn( WFConnection
-> Disconnecting )
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100546] block drbd0: Discarding network
configuration.
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100567] block drbd0: Connection closed
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100575] block drbd0: conn( Disconnecting
-> StandAlone )
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100586] block drbd0: receiver terminated
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100588] block drbd0: Terminating
drbd0_receiver
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100600] block drbd0: disk( UpToDate ->
Failed )
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100608] block drbd0: Sending state for
detaching disk failed
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100615] block drbd0: disk( Failed ->
Diskless )
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100843] block drbd0: drbd_bm_resize
called with capacity == 0
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100872] block drbd0: worker terminated
Apr 2 10:08:04 hp-vh-1 kernel: [65204.100874] block drbd0: Terminating
drbd0_worker
hp-vh-2:
Apr 2 10:07:52 hp-vh-2 kernel: [65194.178332] block drbd0: Starting worker
thread (from drbdsetup [23555])
Apr 2 10:07:52 hp-vh-2 kernel: [65194.178560] block drbd0: disk( Diskless ->
Attaching )
Apr 2 10:07:52 hp-vh-2 kernel: [65194.178887] block drbd0: No usable activity
log found.
Apr 2 10:07:52 hp-vh-2 kernel: [65194.178891] block drbd0: Method to ensure
write ordering: flush
Apr 2 10:07:52 hp-vh-2 kernel: [65194.178898] block drbd0: drbd_bm_resize
called with capacity == 25165824
Apr 2 10:07:52 hp-vh-2 kernel: [65194.179011] block drbd0: resync bitmap:
bits=3145728 words=49152 pages=96
Apr 2 10:07:52 hp-vh-2 kernel: [65194.179014] block drbd0: size = 12 GB
(12582912 KB)
Apr 2 10:07:52 hp-vh-2 kernel: [65194.179018] block drbd0: Writing the whole
bitmap, size changed
Apr 2 10:07:52 hp-vh-2 kernel: [65194.179347] block drbd0: bitmap WRITE of 96
pages took 0 jiffies
Apr 2 10:07:52 hp-vh-2 kernel: [65194.179385] block drbd0: 12 GB (3145728
bits) marked out-of-sync by on disk bit-map.
Apr 2 10:07:52 hp-vh-2 kernel: [65194.180384] block drbd0: bitmap READ of 96
pages took 1 jiffies
Apr 2 10:07:52 hp-vh-2 kernel: [65194.180481] block drbd0: recounting of set
bits took additional 1 jiffies
Apr 2 10:07:52 hp-vh-2 kernel: [65194.180484] block drbd0: 12 GB (3145664
bits) marked out-of-sync by on disk bit-map.
Apr 2 10:07:52 hp-vh-2 kernel: [65194.180488] block drbd0: disk( Attaching ->
Inconsistent )
Apr 2 10:07:52 hp-vh-2 kernel: [65194.180490] block drbd0: attached to UUIDs
0000000000000004:0000000000000000:0000000000000000:0000000000000000
Apr 2 10:07:52 hp-vh-2 kernel: [65194.186445] block drbd0: conn( StandAlone ->
Unconnected )
Apr 2 10:07:52 hp-vh-2 kernel: [65194.186458] block drbd0: Starting receiver
thread (from drbd0_worker [23556])
Apr 2 10:07:52 hp-vh-2 kernel: [65194.186513] block drbd0: receiver (re)started
Apr 2 10:07:52 hp-vh-2 kernel: [65194.186519] block drbd0: conn( Unconnected
-> WFConnection )
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376377] block drbd0: conn( WFConnection
-> Disconnecting )
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376420] block drbd0: Discarding network
configuration.
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376507] block drbd0: Connection closed
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376517] block drbd0: conn( Disconnecting
-> StandAlone )
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376537] block drbd0: receiver terminated
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376539] block drbd0: Terminating
drbd0_receiver
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376568] block drbd0: disk( Inconsistent
-> Failed )
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376581] block drbd0: Sending state for
detaching disk failed
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376593] block drbd0: disk( Failed ->
Diskless )
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376784] block drbd0: drbd_bm_resize
called with capacity == 0
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376822] block drbd0: worker terminated
Apr 2 10:08:04 hp-vh-2 kernel: [65206.376824] block drbd0: Terminating
drbd0_worker
The ganeti error is again the same as mentioned above.
Original comment by daniel.c...@gmail.com
on 2 Apr 2014 at 8:10
Okay, I did a manual DRBD installation now and noticed, that one of the disks
had a corrupt GPT that I removed now. Apart from that all seemed normal and
it's syncing now:
hp-vh-1:
[12:38] 0 root@hp-vh-1 /root $ cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: C0678379139A852E8263B22
0: cs:Unconfigured
1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
ns:304128 nr:0 dw:0 dr:304792 al:0 bm:18 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:23437799868
[>....................] sync'ed: 0.1% (22888476/22888772)Mfinish: 4882:52:29 speed: 1,312 (1,280) K/sec
hp-vh-2:
[12:38] 0 root@hp-vh-2 /root $ cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: C0678379139A852E8263B22
0: cs:Unconfigured
1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
ns:0 nr:304128 dw:304128 dr:0 al:0 bm:18 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:23437799868
[>....................] sync'ed: 0.1% (22888476/22888772)Mfinish: 4882:52:29 speed: 1,312 (1,280) want: 250 K/sec
The speed is pretty bad. I'm using 2 bonded (LACP) Gigabit Ethernet Ports.
Before (on the 8.4 setup) they completely saturated one link while transfering
data, now I'm at roughly 10 Mbit/s:
[12:37] 1 root@hp-vh-1 /root $ vnstat -i bond1 -l
Monitoring bond1... (press CTRL-C to stop)
rx: 52 kbit/s 90 p/s tx: 10.72 Mbit/s 920 p/s^C
But at least it works.
Original comment by daniel.c...@gmail.com
on 2 Apr 2014 at 10:40
Back to ganeti the same happens. It doesn't look like a network problem. Test
with netserver/netperf:
[13:03] 0 root@hp-vh-1 /root $ netperf -H 192.168.200.2 -l 256M -w 256M -p
22125 &
[1] 1069
Packet rate control is not compiled in.
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.200.2
(192.168.200.2) port 0 AF_INET
[13:04] 1 root@hp-vh-1 /root $ vnstat -i bond1 -l
Monitoring bond1... (press CTRL-C to stop)
rx: 3.09 Mbit/s 5996 p/s tx: 962.30 Mbit/s 81373 p/s^C
bond1 / traffic statistics
rx | tx
--------------------------------------+------------------
bytes 21.28 MiB | 6.42 GiB
--------------------------------------+------------------
max 3.14 Mbit/s | 962.36 Mbit/s
average 3.06 Mbit/s | 945.34 Mbit/s
min 3.07 Mbit/s | 962.02 Mbit/s
--------------------------------------+------------------
packets 338002 | 4556538
--------------------------------------+------------------
max 6089 p/s | 81378 p/s
average 5929 p/s | 79939 p/s
min 5947 p/s | 81349 p/s
--------------------------------------+------------------
time 57 seconds
Original comment by daniel.c...@gmail.com
on 2 Apr 2014 at 11:06
The 250 K/sec seems to be the drbd syncer default value, so it looks like plain
DRBD without ganeti works fine.
Original comment by daniel.c...@gmail.com
on 2 Apr 2014 at 12:08
Where does ganeti store it's drbd metadata? Is it possible that old metadata
(from the first, reseted cluster) is still there and causing those problems? As
far as I understand it, ganeti uses the externam metadata for drbd?
Original comment by daniel.c...@gmail.com
on 3 Apr 2014 at 12:34
And some more information:
While gnt-instance add -t drbd -o image+default -s 12g hp-sv-5.zogi.local ...
# on node 1 (master)
$ tail -f /var/log/ganeti/*.log /var/log/messages > gnt-instance-add-node1.log
# on node 2
$ tail -f /var/log/ganeti/*.log /var/log/messages > gnt-instance-add-node2.log
Results in the attached files since it's a lot of data.
Original comment by daniel.c...@gmail.com
on 3 Apr 2014 at 1:31
Attachments:
Now tested everything with
hardened-sources-3.2.54 (drbd 8.3.11) and drbd-utils 8.3.11: error described in
this ticket
hardened-sources-3.7.9 (drbd 8.3.13) and drbd-utils 8.3.13: error described in
this ticket
hardened-sources-3.11.7 (drbd 8.4.0) and drbd-utils 8.4.0: works (but burnin
fails with the degraded disk error reported here [1])
[1]: https://groups.google.com/forum/#!topic/ganeti/rWgEUfwYOe8
Original comment by daniel.c...@gmail.com
on 3 Apr 2014 at 5:06
Regarding sync speed:
Did you modify the DRBD related disk parameters?
Please run a `gnt-cluster info` and look at 'resync-rate' and/or
'dynamic-resync' and related parameters (see `man gnt-cluster` for more
details). Changing those parameters might only affect instances as soon as they
are restarted.
Ganeti configures DRBD devices with external metadata (flexible external
metadata for 8.4 IIRC). A separate meta data LV is created alongside the data
LV by Ganeti when the DRBD device is created. This meta data disk is
initialized anew aswell, so there shouldn't be a problem with stale data. You
can, however, try to zero out the meta data disk generated by Ganeti and see if
this changes anything (the disks can be found under
/dev/xenvg/<instance_uuid>_meta.disk).
Original comment by thoma...@google.com
on 4 Apr 2014 at 6:55
Hi,
no I didn't modify them. The report above was while I tried manually creating
DRBD resources (which worked, although they were slow due to the slow standard
sync-rates).
Regarding DRBD/metadata:
I found that out reading the logs I posted above. It looks like the creation
and even the syncing of the logs goes fine, but than this happens and ganeti
deletes all disks and removes everything again:
==> /var/log/ganeti/master-daemon.log <==
2014-04-03 15:28:48,671: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Waiting for instance hp-sv-5.zogi.local to sync disks
2014-04-03 15:28:48,732: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 10 retries left
2014-04-03 15:28:48,748: ganeti-masterd pid=32729/ClientReq15 INFO Received job
poll request for 669
2014-04-03 15:28:49,782: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 9 retries left
2014-04-03 15:28:50,884: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 8 retries left
2014-04-03 15:28:51,933: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 7 retries left
2014-04-03 15:28:52,982: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 6 retries left
2014-04-03 15:28:54,032: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 5 retries left
2014-04-03 15:28:55,081: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 4 retries left
2014-04-03 15:28:56,130: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 3 retries left
2014-04-03 15:28:57,178: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 2 retries left
2014-04-03 15:28:58,227: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Degraded disks found, 1 retries left
2014-04-03 15:28:59,275: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Instance hp-sv-5.zogi.local's disks are in sync
2014-04-03 15:28:59,289: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO
Removing block devices for instance hp-sv-5.zogi.local
2014-04-03 15:28:59,427: ganeti-masterd pid=32729/ClientReq16 INFO Received job
poll request for 669
As I understand the logs above (especially the lines between 151-230)
everything goes fine until the part posted here.
Another question: Kernels >3.8 (DRBD 8.4) aren't supported at the moment
(Ganeti 2.10.1), are they?
Original comment by daniel.c...@gmail.com
on 4 Apr 2014 at 9:08
It looks to me as if the two DRBD devices fail to communicate with each other
properly.
Could you verify that the two nodes can communicate freely on this IP/port
combination:
192.168.200.1:11001 <-> 192.168.200.2:11001
Note that Ganeti uses the secondary network for DRBD connections, maybe you
used the primary network during your manual tests? Also, Ganeti uses port 11001
in your case, maybe you tested with another port.
Another thing you could try is to toggle the prealloc-wipe-disk parameter of
your cluster (gnt-cluster modify --prealloc-wipe-disk (True|False)). There is
some special logic in Ganeti which optimizes the preallocation, so I'd like to
make sure this does not interfere with your setup.
Original comment by thoma...@google.com
on 4 Apr 2014 at 9:33
It actually was a networking problem. Just for testing I deactivated grsecurity
in the kernel and it's working now. Haven't yet figured out what it was
exactly. Sorry for the hassel. I'll update the ticket, when I understand
completely what went wrong.
Original comment by daniel.c...@gmail.com
on 4 Apr 2014 at 12:55
Original issue reported on code.google.com by
daniel.c...@gmail.com
on 1 Apr 2014 at 3:14