DRBD sync problems after downgrade from DRBD 8.4 to 8.3

GoogleCodeExporter commented 9 years ago

What software version are you running? Please provide the output of "gnt-
cluster --version", "gnt-cluster version", and "hspace --version".

$ gnt-cluster --version
gnt-cluster (ganeti v2.10.1) 2.10.1

$ gnt-cluster version
Software version: 2.10.1
Internode protocol: 2100000
Configuration format: 2100000
OS api version: 20
Export interface: 0
VCS version: (ganeti) version v2.10.1

$ hspace --version
hspace (ganeti) version v2.10.1
compiled with ghc 7.6
running on linux x86_64

$ uname -a
Linux 3.2.54-hardened-r9 #2 SMP Mon Mar 31 16:25:16 CEST 2014 x86_64 Intel(R) 
Xeon(R) CPU E5-2620 v2 @ 2.10GHz GenuineIntel GNU/Linux

DRBD Version:
sys-cluster/drbd-8.3.11-r1

$ cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: C0678379139A852E8263B22

$ cat /sys/module/drbd/version
8.3.11

What distribution are you using?
Gentoo Linux 3.2.54-hardened-r9

What steps will reproduce the problem?

As can be seen here 
(https://groups.google.com/forum/#!topic/ganeti/rWgEUfwYOe8) I accidentally 
used DRBD 8.4. After that I downgraded to the DRBD 8.3.11 (as you can see 
above), but I am still getting the "There are some degraded disks for this 
instance." while creating instances or doing a burnin. After that I've reset 
everything, recreated the LVM vg group and reinitialized the cluster. I still 
get the error.

dmesg gives me the following (while instance creation):
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.120096] block drbd0: Starting worker 
thread (from drbdsetup [6262])
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.120280] block drbd0: disk( Diskless -> 
Attaching )
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.120581] block drbd0: No usable activity 
log found.
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.120585] block drbd0: Method to ensure 
write ordering: flush
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.120592] block drbd0: drbd_bm_resize 
called with capacity == 25165824
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.120698] block drbd0: resync bitmap: 
bits=3145728 words=49152 pages=96
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.120701] block drbd0: size = 12 GB 
(12582912 KB)
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.120705] block drbd0: Writing the whole 
bitmap, size changed
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.121554] block drbd0: bitmap WRITE of 96 
pages took 1 jiffies
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.121582] block drbd0: 12 GB (3145728 
bits) marked out-of-sync by on disk bit-map.
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.122018] block drbd0: bitmap READ of 96 
pages took 0 jiffies
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.122087] block drbd0: recounting of set 
bits took additional 0 jiffies
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.122089] block drbd0: 12 GB (3145664 
bits) marked out-of-sync by on disk bit-map.
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.122093] block drbd0: disk( Attaching -> 
Inconsistent )
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.122095] block drbd0: attached to UUIDs 
0000000000000004:0000000000000000:0000000000000000:0000000000000000
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.128061] block drbd0: conn( StandAlone -> 
Unconnected )
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.128074] block drbd0: Starting receiver 
thread (from drbd0_worker [6263])
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.128131] block drbd0: receiver (re)started
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.128138] block drbd0: conn( Unconnected 
-> WFConnection )
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.207679] block drbd0: role( Secondary -> 
Primary ) disk( Inconsistent -> UpToDate )
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.207826] block drbd0: Forced to consider 
local data as UpToDate!
Apr  1 16:22:25 hp-vh-1 kernel: [ 1369.207849] block drbd0: new current UUID 
587EACA1FEF0E151:0000000000000004:0000000000000000:0000000000000000
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.989695] block drbd0: role( Primary -> 
Secondary )
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.989781] block drbd0: bitmap WRITE of 0 
pages took 0 jiffies
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.989859] block drbd0: 12 GB (3145664 
bits) marked out-of-sync by on disk bit-map.
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.989987] block drbd0: conn( WFConnection 
-> Disconnecting )
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990009] block drbd0: Discarding network 
configuration.
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990030] block drbd0: Connection closed
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990039] block drbd0: conn( Disconnecting 
-> StandAlone )
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990053] block drbd0: receiver terminated
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990055] block drbd0: Terminating 
drbd0_receiver
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990068] block drbd0: disk( UpToDate -> 
Failed )
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990077] block drbd0: Sending state for 
detaching disk failed
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990086] block drbd0: disk( Failed -> 
Diskless )
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990269] block drbd0: drbd_bm_resize 
called with capacity == 0
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990299] block drbd0: worker terminated
Apr  1 16:22:37 hp-vh-1 kernel: [ 1380.990301] block drbd0: Terminating 
drbd0_worker

Original issue reported on code.google.com by daniel.c...@gmail.com on 1 Apr 2014 at 3:14

GoogleCodeExporter commented 9 years ago

The ganeti error:
[17:48] 1 root@hp-vh-1 /root $ gnt-instance add -t drbd -o image+default -s 12g 
-n hp-vh-1.zogi.local:hp-vh-2.zogi.local hp-sv-5.zogi.local
Tue Apr  1 17:48:42 2014 * creating instance disks...
Tue Apr  1 17:48:44 2014 adding instance hp-sv-5.zogi.local to cluster config
Tue Apr  1 17:48:44 2014  - INFO: Waiting for instance hp-sv-5.zogi.local to 
sync disks
Tue Apr  1 17:48:55 2014  - INFO: Instance hp-sv-5.zogi.local's disks are in 
sync
Failure: command execution error:
There are some degraded disks for this instance

and after that I get a ressource 0: cs:Unconfigured in /proc/drbd

$ cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: C0678379139A852E8263B22
 0: cs:Unconfigured

Original comment by daniel.c...@gmail.com on 1 Apr 2014 at 3:50

GoogleCodeExporter commented 9 years ago

Since I completly reset everything (vgremove, vgcreate, gnt-cluster init, etc.) 
I'm not really sure if it has something to do with the downgrade. It's probably 
a drbd issue.

Original comment by daniel.c...@gmail.com on 1 Apr 2014 at 3:56

GoogleCodeExporter commented 9 years ago

It really looks like a DRBD problem. Did you try to set up a DRBD device 
without Ganeti (see http://www.drbd.org/users-guide-8.3/)?
Also, it would be interesting to see what the DRBD output on the other node is 
(hp-vh-2 in your case, I guess).

Original comment by thoma...@google.com on 2 Apr 2014 at 7:01

Changed state: NeedFeedback

GoogleCodeExporter commented 9 years ago

I haven't tried that (set up DRBD without ganeti) yet. Will do it later and 
post the results. Here is the output on both nodes (dmesg) while I did an 
gnt-instance add ... -n hp-vh-1:hp-vh-2 ..

hp-vh-1:

Apr  2 10:07:50 hp-vh-1 kernel: [65190.564266] block drbd0: Starting worker 
thread (from drbdsetup [11161])
Apr  2 10:07:50 hp-vh-1 kernel: [65190.564391] block drbd0: disk( Diskless -> 
Attaching )
Apr  2 10:07:50 hp-vh-1 kernel: [65190.564796] block drbd0: No usable activity 
log found.
Apr  2 10:07:50 hp-vh-1 kernel: [65190.564799] block drbd0: Method to ensure 
write ordering: flush
Apr  2 10:07:50 hp-vh-1 kernel: [65190.564805] block drbd0: drbd_bm_resize 
called with capacity == 25165824
Apr  2 10:07:50 hp-vh-1 kernel: [65190.564889] block drbd0: resync bitmap: 
bits=3145728 words=49152 pages=96
Apr  2 10:07:50 hp-vh-1 kernel: [65190.564892] block drbd0: size = 12 GB 
(12582912 KB)
Apr  2 10:07:50 hp-vh-1 kernel: [65190.564895] block drbd0: Writing the whole 
bitmap, size changed
Apr  2 10:07:50 hp-vh-1 kernel: [65190.565765] block drbd0: bitmap WRITE of 96 
pages took 1 jiffies
Apr  2 10:07:50 hp-vh-1 kernel: [65190.565792] block drbd0: 12 GB (3145728 
bits) marked out-of-sync by on disk bit-map.
Apr  2 10:07:50 hp-vh-1 kernel: [65190.566300] block drbd0: bitmap READ of 96 
pages took 0 jiffies
Apr  2 10:07:50 hp-vh-1 kernel: [65190.566374] block drbd0: recounting of set 
bits took additional 0 jiffies
Apr  2 10:07:50 hp-vh-1 kernel: [65190.566376] block drbd0: 12 GB (3145664 
bits) marked out-of-sync by on disk bit-map.
Apr  2 10:07:50 hp-vh-1 kernel: [65190.566380] block drbd0: disk( Attaching -> 
Inconsistent )
Apr  2 10:07:50 hp-vh-1 kernel: [65190.566383] block drbd0: attached to UUIDs 
0000000000000004:0000000000000000:0000000000000000:0000000000000000
Apr  2 10:07:50 hp-vh-1 kernel: [65190.573624] block drbd0: conn( StandAlone -> 
Unconnected )
Apr  2 10:07:50 hp-vh-1 kernel: [65190.573638] block drbd0: Starting receiver 
thread (from drbd0_worker [11162])
Apr  2 10:07:50 hp-vh-1 kernel: [65190.573674] block drbd0: receiver (re)started
Apr  2 10:07:50 hp-vh-1 kernel: [65190.573679] block drbd0: conn( Unconnected 
-> WFConnection )
Apr  2 10:07:51 hp-vh-1 kernel: [65190.653391] block drbd0: role( Secondary -> 
Primary ) disk( Inconsistent -> UpToDate )
Apr  2 10:07:51 hp-vh-1 kernel: [65190.653517] block drbd0: Forced to consider 
local data as UpToDate!
Apr  2 10:07:51 hp-vh-1 kernel: [65190.653532] block drbd0: new current UUID 
B3BAB3C2E93C1F63:0000000000000004:0000000000000000:0000000000000000
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100196] block drbd0: role( Primary -> 
Secondary )
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100218] block drbd0: bitmap WRITE of 0 
pages took 0 jiffies
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100278] block drbd0: 12 GB (3145664 
bits) marked out-of-sync by on disk bit-map.
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100524] block drbd0: conn( WFConnection 
-> Disconnecting )
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100546] block drbd0: Discarding network 
configuration.
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100567] block drbd0: Connection closed
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100575] block drbd0: conn( Disconnecting 
-> StandAlone )
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100586] block drbd0: receiver terminated
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100588] block drbd0: Terminating 
drbd0_receiver
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100600] block drbd0: disk( UpToDate -> 
Failed )
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100608] block drbd0: Sending state for 
detaching disk failed
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100615] block drbd0: disk( Failed -> 
Diskless )
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100843] block drbd0: drbd_bm_resize 
called with capacity == 0
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100872] block drbd0: worker terminated
Apr  2 10:08:04 hp-vh-1 kernel: [65204.100874] block drbd0: Terminating 
drbd0_worker

hp-vh-2:

Apr  2 10:07:52 hp-vh-2 kernel: [65194.178332] block drbd0: Starting worker 
thread (from drbdsetup [23555])
Apr  2 10:07:52 hp-vh-2 kernel: [65194.178560] block drbd0: disk( Diskless -> 
Attaching )
Apr  2 10:07:52 hp-vh-2 kernel: [65194.178887] block drbd0: No usable activity 
log found.
Apr  2 10:07:52 hp-vh-2 kernel: [65194.178891] block drbd0: Method to ensure 
write ordering: flush
Apr  2 10:07:52 hp-vh-2 kernel: [65194.178898] block drbd0: drbd_bm_resize 
called with capacity == 25165824
Apr  2 10:07:52 hp-vh-2 kernel: [65194.179011] block drbd0: resync bitmap: 
bits=3145728 words=49152 pages=96
Apr  2 10:07:52 hp-vh-2 kernel: [65194.179014] block drbd0: size = 12 GB 
(12582912 KB)
Apr  2 10:07:52 hp-vh-2 kernel: [65194.179018] block drbd0: Writing the whole 
bitmap, size changed
Apr  2 10:07:52 hp-vh-2 kernel: [65194.179347] block drbd0: bitmap WRITE of 96 
pages took 0 jiffies
Apr  2 10:07:52 hp-vh-2 kernel: [65194.179385] block drbd0: 12 GB (3145728 
bits) marked out-of-sync by on disk bit-map.
Apr  2 10:07:52 hp-vh-2 kernel: [65194.180384] block drbd0: bitmap READ of 96 
pages took 1 jiffies
Apr  2 10:07:52 hp-vh-2 kernel: [65194.180481] block drbd0: recounting of set 
bits took additional 1 jiffies
Apr  2 10:07:52 hp-vh-2 kernel: [65194.180484] block drbd0: 12 GB (3145664 
bits) marked out-of-sync by on disk bit-map.
Apr  2 10:07:52 hp-vh-2 kernel: [65194.180488] block drbd0: disk( Attaching -> 
Inconsistent )
Apr  2 10:07:52 hp-vh-2 kernel: [65194.180490] block drbd0: attached to UUIDs 
0000000000000004:0000000000000000:0000000000000000:0000000000000000
Apr  2 10:07:52 hp-vh-2 kernel: [65194.186445] block drbd0: conn( StandAlone -> 
Unconnected )
Apr  2 10:07:52 hp-vh-2 kernel: [65194.186458] block drbd0: Starting receiver 
thread (from drbd0_worker [23556])
Apr  2 10:07:52 hp-vh-2 kernel: [65194.186513] block drbd0: receiver (re)started
Apr  2 10:07:52 hp-vh-2 kernel: [65194.186519] block drbd0: conn( Unconnected 
-> WFConnection )
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376377] block drbd0: conn( WFConnection 
-> Disconnecting )
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376420] block drbd0: Discarding network 
configuration.
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376507] block drbd0: Connection closed
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376517] block drbd0: conn( Disconnecting 
-> StandAlone )
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376537] block drbd0: receiver terminated
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376539] block drbd0: Terminating 
drbd0_receiver
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376568] block drbd0: disk( Inconsistent 
-> Failed )
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376581] block drbd0: Sending state for 
detaching disk failed
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376593] block drbd0: disk( Failed -> 
Diskless )
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376784] block drbd0: drbd_bm_resize 
called with capacity == 0
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376822] block drbd0: worker terminated
Apr  2 10:08:04 hp-vh-2 kernel: [65206.376824] block drbd0: Terminating 
drbd0_worker

The ganeti error is again the same as mentioned above.

Original comment by daniel.c...@gmail.com on 2 Apr 2014 at 8:10

GoogleCodeExporter commented 9 years ago

Okay, I did a manual DRBD installation now and noticed, that one of the disks 
had a corrupt GPT that I removed now. Apart from that all seemed normal and 
it's syncing now:

hp-vh-1:
[12:38] 0 root@hp-vh-1 /root $ cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: C0678379139A852E8263B22
 0: cs:Unconfigured
 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:304128 nr:0 dw:0 dr:304792 al:0 bm:18 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:23437799868
    [>....................] sync'ed:  0.1% (22888476/22888772)Mfinish: 4882:52:29 speed: 1,312 (1,280) K/sec

hp-vh-2:
[12:38] 0 root@hp-vh-2 /root $ cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: C0678379139A852E8263B22
 0: cs:Unconfigured
 1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
    ns:0 nr:304128 dw:304128 dr:0 al:0 bm:18 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:23437799868
    [>....................] sync'ed:  0.1% (22888476/22888772)Mfinish: 4882:52:29 speed: 1,312 (1,280) want: 250 K/sec

The speed is pretty bad. I'm using 2 bonded (LACP) Gigabit Ethernet Ports. 
Before (on the 8.4 setup) they completely saturated one link while transfering 
data, now I'm at roughly 10 Mbit/s:

[12:37] 1 root@hp-vh-1 /root $ vnstat -i bond1 -l
Monitoring bond1...    (press CTRL-C to stop)

   rx:       52 kbit/s    90 p/s          tx:    10.72 Mbit/s   920 p/s^C

But at least it works.

Original comment by daniel.c...@gmail.com on 2 Apr 2014 at 10:40

GoogleCodeExporter commented 9 years ago

Back to ganeti the same happens. It doesn't look like a network problem. Test 
with netserver/netperf:

[13:03] 0 root@hp-vh-1 /root $ netperf -H 192.168.200.2 -l 256M -w 256M -p 
22125 &
[1] 1069
Packet rate control is not compiled in.
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.200.2 
(192.168.200.2) port 0 AF_INET
[13:04] 1 root@hp-vh-1 /root $ vnstat -i bond1 -l
Monitoring bond1...    (press CTRL-C to stop)

   rx:     3.09 Mbit/s  5996 p/s          tx:   962.30 Mbit/s 81373 p/s^C

 bond1  /  traffic statistics

                           rx         |       tx
--------------------------------------+------------------
  bytes                    21.28 MiB  |        6.42 GiB
--------------------------------------+------------------
          max            3.14 Mbit/s  |   962.36 Mbit/s
      average            3.06 Mbit/s  |   945.34 Mbit/s
          min            3.07 Mbit/s  |   962.02 Mbit/s
--------------------------------------+------------------
  packets                     338002  |         4556538
--------------------------------------+------------------
          max               6089 p/s  |       81378 p/s
      average               5929 p/s  |       79939 p/s
          min               5947 p/s  |       81349 p/s
--------------------------------------+------------------
  time                    57 seconds

Original comment by daniel.c...@gmail.com on 2 Apr 2014 at 11:06

GoogleCodeExporter commented 9 years ago

The 250 K/sec seems to be the drbd syncer default value, so it looks like plain 
DRBD without ganeti works fine.

Original comment by daniel.c...@gmail.com on 2 Apr 2014 at 12:08

GoogleCodeExporter commented 9 years ago

Where does ganeti store it's drbd metadata? Is it possible that old metadata 
(from the first, reseted cluster) is still there and causing those problems? As 
far as I understand it, ganeti uses the externam metadata for drbd?

Original comment by daniel.c...@gmail.com on 3 Apr 2014 at 12:34

GoogleCodeExporter commented 9 years ago

And some more information:

While gnt-instance add -t drbd -o image+default -s 12g hp-sv-5.zogi.local ...

# on node 1 (master)
$ tail -f /var/log/ganeti/*.log /var/log/messages > gnt-instance-add-node1.log

# on node 2
$ tail -f /var/log/ganeti/*.log /var/log/messages > gnt-instance-add-node2.log

Results in the attached files since it's a lot of data.

Original comment by daniel.c...@gmail.com on 3 Apr 2014 at 1:31

Attachments:

GoogleCodeExporter commented 9 years ago

Now tested everything with
hardened-sources-3.2.54 (drbd 8.3.11) and drbd-utils 8.3.11: error described in 
this ticket
hardened-sources-3.7.9 (drbd 8.3.13) and drbd-utils 8.3.13: error described in 
this ticket
hardened-sources-3.11.7 (drbd 8.4.0) and drbd-utils 8.4.0: works (but burnin 
fails with the degraded disk error reported here [1])

[1]:  https://groups.google.com/forum/#!topic/ganeti/rWgEUfwYOe8

Original comment by daniel.c...@gmail.com on 3 Apr 2014 at 5:06

GoogleCodeExporter commented 9 years ago

Regarding sync speed:
Did you modify the DRBD related disk parameters?
Please run a `gnt-cluster info` and look at 'resync-rate' and/or 
'dynamic-resync' and related parameters (see `man gnt-cluster` for more 
details). Changing those parameters might only affect instances as soon as they 
are restarted.

Ganeti configures DRBD devices with external metadata (flexible external 
metadata for 8.4 IIRC). A separate meta data LV is created alongside the data 
LV by Ganeti when the DRBD device is created. This meta data disk is 
initialized anew aswell, so there shouldn't be a problem with stale data. You 
can, however, try to zero out the meta data disk generated by Ganeti and see if 
this changes anything (the disks can be found under 
/dev/xenvg/<instance_uuid>_meta.disk).

Original comment by thoma...@google.com on 4 Apr 2014 at 6:55

GoogleCodeExporter commented 9 years ago

Hi,
no I didn't modify them. The report above was while I tried manually creating 
DRBD resources (which worked, although they were slow due to the slow standard 
sync-rates).

Regarding DRBD/metadata:
I found that out reading the logs I posted above. It looks like the creation 
and even the syncing of the logs goes fine, but than this happens and ganeti 
deletes all disks and removes everything again:

==> /var/log/ganeti/master-daemon.log <==
2014-04-03 15:28:48,671: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Waiting for instance hp-sv-5.zogi.local to sync disks
2014-04-03 15:28:48,732: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 10 retries left
2014-04-03 15:28:48,748: ganeti-masterd pid=32729/ClientReq15 INFO Received job 
poll request for 669
2014-04-03 15:28:49,782: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 9 retries left
2014-04-03 15:28:50,884: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 8 retries left
2014-04-03 15:28:51,933: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 7 retries left
2014-04-03 15:28:52,982: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 6 retries left
2014-04-03 15:28:54,032: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 5 retries left
2014-04-03 15:28:55,081: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 4 retries left
2014-04-03 15:28:56,130: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 3 retries left
2014-04-03 15:28:57,178: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 2 retries left
2014-04-03 15:28:58,227: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Degraded disks found, 1 retries left
2014-04-03 15:28:59,275: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Instance hp-sv-5.zogi.local's disks are in sync
2014-04-03 15:28:59,289: ganeti-masterd pid=32729/Jq7/Job669/I_CREATE INFO 
Removing block devices for instance hp-sv-5.zogi.local
2014-04-03 15:28:59,427: ganeti-masterd pid=32729/ClientReq16 INFO Received job 
poll request for 669

As I understand the logs above (especially the lines between 151-230) 
everything goes fine until the part posted here.

Another question: Kernels >3.8 (DRBD 8.4) aren't supported at the moment 
(Ganeti 2.10.1), are they?

Original comment by daniel.c...@gmail.com on 4 Apr 2014 at 9:08

GoogleCodeExporter commented 9 years ago

It looks to me as if the two DRBD devices fail to communicate with each other 
properly.
Could you verify that the two nodes can communicate freely on this IP/port 
combination:

192.168.200.1:11001 <-> 192.168.200.2:11001

Note that Ganeti uses the secondary network for DRBD connections, maybe you 
used the primary network during your manual tests? Also, Ganeti uses port 11001 
in your case, maybe you tested with another port.

Another thing you could try is to toggle the prealloc-wipe-disk parameter of 
your cluster (gnt-cluster modify --prealloc-wipe-disk (True|False)). There is 
some special logic in Ganeti which optimizes the preallocation, so I'd like to 
make sure this does not interfere with your setup.

Original comment by thoma...@google.com on 4 Apr 2014 at 9:33

GoogleCodeExporter commented 9 years ago

It actually was a networking problem. Just for testing I deactivated grsecurity 
in the kernel and it's working now. Haven't yet figured out what it was 
exactly. Sorry for the hassel. I'll update the ticket, when I understand 
completely what went wrong.

Original comment by daniel.c...@gmail.com on 4 Apr 2014 at 12:55

olopez32 / ganeti

DRBD sync problems after downgrade from DRBD 8.4 to 8.3 #787