olopez32 / ganeti

Automatically exported from code.google.com/p/ganeti
0 stars 0 forks source link

Does not wait for DRBD connect and uses automatic split-brain resolving #790

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Ganeti's use of DRBD is completely unsafe.
It does not wait for the nodes to connect before promoting a node to primary, 
it uses automatic split-brain resolution and it is unaware of errors during 
operation.

Environment: Ganeti 2.10.1 on Debian Wheezy

| gnt-cluster (ganeti v2.10.1) 2.10.1

| Software version: 2.10.1
| Internode protocol: 2100000
| Configuration format: 2100000
| OS api version: 20
| Export interface: 0
| VCS version: (ganeti) version v2.10.1

| hspace (ganeti) version v2.10.1
| compiled with ghc 7.4
| running on linux x86_64

Ganeti's DRBD seettings:
| disk {
|         on-io-error             detach;
| }
| net {
|         after-sb-0pri           discard-zero-changes;
|         after-sb-1pri           consensus;
| }

Log for a "failed" attach:
| block drbd14: Starting worker thread (from drbdsetup [16187])
| block drbd14: disk( Diskless -> Attaching ) 
| block drbd14: Found 6 transactions (7 active extents) in activity log.
| block drbd14: Method to ensure write ordering: flush
| block drbd14: drbd_bm_resize called with capacity == 33554432
| block drbd14: resync bitmap: bits=4194304 words=65536 pages=128
| block drbd14: size = 16 GB (16777216 KB)
| block drbd14: bitmap READ of 128 pages took 1 jiffies
| block drbd14: recounting of set bits took additional 0 jiffies
| block drbd14: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
| block drbd14: disk( Attaching -> UpToDate ) 
| block drbd14: attached to UUIDs 
E262D6FE3DC2D38C:0000000000000000:0001000000000004:0000000000000004
| block drbd14: conn( StandAlone -> Unconnected ) 
| block drbd14: Starting receiver thread (from drbd14_worker [16188])
| block drbd14: receiver (re)started
| block drbd14: conn( Unconnected -> WFConnection ) 
| block drbd14: role( Secondary -> Primary ) 
| block drbd14: new current UUID 
7F0194F8DCDF78C5:E262D6FE3DC2D38C:0001000000000004:0000000000000004

Because of this we lost data in a way DRBD usually protects against.
This happened in a cluster with DRBD with two nodes (NodeA and NodeB).

- Initial state: systems run on NodeA, with a DRBD state of Primary/Secondary, 
UpToDate/UpToDate.
- NodeA lost its storage, which means
  - Ganeti did not longer react,
  - systems are still running,
  - DRBD runs with Primary/Secondary, Diskless/UpToDate---so the current data is on NodeB while the systems are still running on NodeA!
- NodeA gets rebooted and starts systems:
  - DRBD on NodeB detects disconnect and generates new generation.[1]
  - DRBD is activated on NodeA and promoted to primary before it got a chance to check which side is current, generating a new generation and leading to a split-brain.
  - Split-brain detection kicks in and drops the connection as the situation does not allow for automatic resolving (both sides where written on).
  - Systems runs on old(!) data.
  - DRBD device on NodeB is restarted for some reason. This cleans the path for automatic split-brain resolution and overrides the (newer) data on NodeB.

The main problems are the missing wait for connection and the complete 
unawareness of unclean DRBD.

Ganeti does not wait for connection. However it can only find out the real 
state of the device if both nodes are connected. Under no circumstances can it 
relate from it's own state to DRBD state. The init script waits for connection 
and is pretty clear that you have to manually check the state if it can't 
connect. So Ganeti _must_ wait for connection if not forced by the admin.

Ganeti allows state changes if the DRBD is not clean, aka both sides in the 
UpToDate state. If they are not, every change, for example deactivation, can 
trash data.

Ganeti sets automatic split-brain resolution. The DRBD documentation clearly 
states that automatic resolution _must_ only be activated if the circumstances 
allow for it.[2]

[1]: http://www.drbd.org/users-guide/s-gi.html
[2]: 
http://www.drbd.org/users-guide/s-split-brain-notification-and-recovery.html

Original issue reported on code.google.com by bastian....@credativ.de on 3 Apr 2014 at 11:17

GoogleCodeExporter commented 9 years ago
Hi,

I'm not sure if I can follow you completely, please see my questions inline:

> Ganeti's use of DRBD is completely unsafe.
> It does not wait for the nodes to connect before promoting a node to
> primary, it uses automatic split-brain resolution and it is unaware of
> errors during operation.
>
> Environment: Ganeti 2.10.1 on Debian Wheezy
>
> | gnt-cluster (ganeti v2.10.1) 2.10.1
>
> | Software version: 2.10.1
> | Internode protocol: 2100000
> | Configuration format: 2100000
> | OS api version: 20
> | Export interface: 0
> | VCS version: (ganeti) version v2.10.1
>
> | hspace (ganeti) version v2.10.1
> | compiled with ghc 7.4
> | running on linux x86_64
>
> Ganeti's DRBD seettings:
> | disk {
> |         on-io-error             detach;
> | }
> | net {
> |         after-sb-0pri           discard-zero-changes;
> |         after-sb-1pri           consensus;
> | }
>
> Log for a "failed" attach:
> | block drbd14: Starting worker thread (from drbdsetup [16187])
> | block drbd14: disk( Diskless -> Attaching )
> | block drbd14: Found 6 transactions (7 active extents) in activity log.
> | block drbd14: Method to ensure write ordering: flush
> | block drbd14: drbd_bm_resize called with capacity =3D=3D 33554432
> | block drbd14: resync bitmap: bits=3D4194304 words=3D65536 pages=3D128
> | block drbd14: size =3D 16 GB (16777216 KB)
> | block drbd14: bitmap READ of 128 pages took 1 jiffies
> | block drbd14: recounting of set bits took additional 0 jiffies
> | block drbd14: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
> | block drbd14: disk( Attaching -> UpToDate )
> | block drbd14: attached to UUIDs E262D6FE3DC2D38C:0000000000000000:
> 0001000000000004:0000000000000004
> | block drbd14: conn( StandAlone -> Unconnected )
> | block drbd14: Starting receiver thread (from drbd14_worker [16188])
> | block drbd14: receiver (re)started
> | block drbd14: conn( Unconnected -> WFConnection )
> | block drbd14: role( Secondary -> Primary )
> | block drbd14: new current UUID 7F0194F8DCDF78C5:E262D6FE3DC2D38C:
> 0001000000000004:0000000000000004
>
> Because of this we lost data in a way DRBD usually protects against.
> This happened in a cluster with DRBD with two nodes (NodeA and NodeB).
>
> - Initial state: systems run on NodeA, with a DRBD state of
> Primary/Secondary, UpToDate/UpToDate.
> - NodeA lost its storage, which means
>

How did the node loose its storage? Was DRBD manually detached there?

>   - Ganeti did not longer react,
>

What do you mean by "Ganeti did not longer react"?
The usual procedure in such a case is to migrate the instance off to the
secondary node (so this get's the primary), offline the failed node and
recreate a new secondary disk on another node.
Once the original node is repaired, it has to be made sure that no
instances are running on it anymore. Ideally, it should be
cleared/reinstalled completely before being re-integrated in the cluster.

>   - systems are still running,
>   - DRBD runs with Primary/Secondary, Diskless/UpToDate---so the current
> data is on NodeB while the systems are still running on NodeA!
> - NodeA gets rebooted and starts systems:
>

This means that the instance never ran on NodeB? I.e. it was never
migrated/failed over to NodeB?

>   - DRBD on NodeB detects disconnect and generates new generation.[1]
>

According to the link you provided, the secondary node does not generate a
new generation. See section 17.2.3.1 ... "On the secondary node, the GI
tuple remains unchanged." ...

>   - DRBD is activated on NodeA and promoted to primary before it got a
> chance to check which side is current, generating a new generation and
> leading to a split-brain.
>   - Split-brain detection kicks in and drops the connection as the
> situation does not allow for automatic resolving (both sides where written
> on).
>

Why was there data written on the secondary disk? Did you mount the disk on
the secondary? If you didn't fail over the instance to the secondary, how
was data changed there?

>   - Systems runs on old(!) data.
>

As I pointed out above, bringing back a node after it has been repaired
needs some precautions to be taken. In particular, you should isolate the
node and clean all instances and Ganeti configuration from it before you
bring it back to the cluster (or even better, re-image it). Otherwise it's
possible that instances are automatically started which now run already on
another node in the cluster, or other undesired side effects can happen
(usually but not uniquely by the Ganeti watcher).

>   - DRBD device on NodeB is restarted for some reason. This cleans the
> path for automatic split-brain resolution and overrides the (newer) data on
> NodeB.
>
See my comment below.

>
> The main problems are the missing wait for connection and the complete
> unawareness of unclean DRBD.
>

Ganeti does, in fact, wait for DRBD connections to be established. A DRBD
device is only considered "good" if both nodes are Connected/UpToDate.

>
> Ganeti does not wait for connection. However it can only find out the real
> state of the device if both nodes are connected. Under no circumstances can
> it relate from it's own state to DRBD state. The init script waits for
> connection and is pretty clear that you have to manually check the state if
> it can't connect. So Ganeti _must_ wait for connection if not forced by the
> admin.
>

I think you maybe didn't understand a fundamental aspect of Ganeti here.
Ganeti has the notion of a primary node, which is the node on which the
instance (the VM) is running. The secondary node(s) are those nodes where
the instance data is mirrored to, and to which Ganeti can migrate/failover
instances.
This concept of primary/secondary is not dependent on DRBDs
primary/secondary concept. Ganeti will always generate DRBD devices based
on Ganeti's primary/secondary scheme for you, not the other way round. So
if you want to put the primary copy of the data on another node, you have
to go through Ganeti's failover/migrate procedure in order to not screw up
your cluster.

>
> Ganeti allows state changes if the DRBD is not clean, aka both sides in
> the UpToDate state. If they are not, every change, for example
> deactivation, can trash data.
>

Could you elaborate on this?
Ganeti is required to make state changes in such circumstance. If, for
example, a node goes down, the DRBD devices on the "partner" nodes are
broken. Ganeti allows to recreate another replica of the data on another
node, which necessarily requires a state change of the DRBD connections.

>
> Ganeti sets automatic split-brain resolution. The DRBD documentation
> clearly states that automatic resolution _must_ only be activated if the
> circumstances allow for it.[2]
>

Ganeti sets up the most conservative split-brain resolution available in
DRBD (as you pointed out above). Split-brains are only resolved
automatically if there are no modifications to the disk on one side of the
connection, so no data loss is possible (due to split-brain resolution). I
don't know exactly how data was lost in your case, but DRBD's split-brain
resolution wasn't the cause for it.

As pointed out above, I suspect that the problem was two-fold:
 * The repaired node came up without adequate preparation
 * The data was changed on the secondary node (which implies that somehow
somebody must have promoted the secondary node to the primary behind
Ganeti's back)

I'm happy to debug this further in order to work out how Ganeti could
protect better against data loss.

Original comment by thoma...@google.com on 4 Apr 2014 at 11:33

GoogleCodeExporter commented 9 years ago
The RAID controller firmware paniced.
No, DRBD automatically removes disks with io errors.

Ganeti was not longer able to reach the daemons on this zombie node.

Exactly. They only ran on NodeA.

At the beginning: "DRBD marks the start of a new data generation at each
of the following occurrences: _a resource in the primary role
disconnecting._".

Diskless is a weird stat that is not extra documented in there.

DRBD moves ressources to diskless in case of io-errors, this is properly
documented.  No, it was never mounted anywhere else.

So a crashed node must not be allowed to boot under any circumstances?
How about a normal reboot?

Can you please specify the code location?  My quick test with a Ganeti
2.10.1 shows that it sets DRBD ressources to Primary even if they are
not Connected.  And a quick code review does not show any check in this
code path.

This is not accurate.  The write counter is volatile.  So if Ganeti ever
shuts down the DRBD device, which it does for fun, the counter is reset.

Bastian

Original comment by bastian....@credativ.de on 4 Apr 2014 at 12:30

GoogleCodeExporter commented 9 years ago
So I guess Ganeti was running on the same RAID as your instance data was 
located on, and that's why the node stopped to respond.

Note that the DRBD doc states that the GI tuple only changes on the primary 
node (so NodeA in your case), not on the secondary. But I think this doesn't 
matter too much, as we agree that the GI tuples are different from each other 
once the DRBD device is reconnected.

So, here's what I've understood:
 * Disk on primary node fails, the instance keeps running purely on the disk on the secondary node.
 * (Inaccessible) data on primary node gets older than the data on the secondary node.
 * Primary node gets rebooted. The instance is (probably forcefully) stopped.
 * Ganeti does not know by now that the secondary has newer data than the primary.
 * Primary node comes up. Ganeti re-establishes the DRBD device and connects it with the secondary node.
  - According to my understanding, this should have either copied the data from the secondary (because it uses `--after-sb-0pri discard-zero-changes`) or should have failed. Code in ganeti/storage/drbd.py (in _CheckNetworkConfig()) checks that a successful link is established. Only after this check succeeds, the disk is promoted to primary.
  - It sounds like either Ganeti failed to wait and check on the successful link here (or the detection is wrong), or DRBD resolved the split-brain wrongly, as it _did_ discard changes.
 * The instance is restarted on the primary node.

Does this sound like what happened? I'll try to reproduce this issue on a test 
cluster to see if I can get more details on this issue.

Original comment by thoma...@google.com on 7 Apr 2014 at 12:10

GoogleCodeExporter commented 9 years ago
Yes.

Exactly.

No. It
- creates the DRBD device and
- sets it to Primary.
It does _not_ wait for the connection.

My copy only checks the config, aka if remote/local ip and port are set
correctly.  I don't see a check for connection.

Yes.

This also.  DRBD ran without connection for some time.  The cluster
verify correctly showed this, so Ganeti is able to detect this problem.

To clean this error up, a collegue decided to shut down the instance.
This makes Ganeti deconfigure the DRBD device.  After this the write
counter is reset and the discard-zero-changes policy applies.

Yep.

Almost.  I think it is best seen by using iptables to reject any
connection to DRBD on the other node.

Bastian Blank

Original comment by bastian....@credativ.de on 7 Apr 2014 at 1:39