No disaster recovery possible in security, privacy enabled pbft consensus group

RicHernandez2 commented 8 years ago

Description: If you enable security and privacy, you are not allowed to use a username, password more than once, and all the peers have to login with a username and password, in case of failure, a peer may never be able to join again.

Steps:

Setup a group of 4 consensus and 1 CA server with security and privacy enabled
Bring down one of the 4 servers
Bring the server back up

Results:

You can never bring a downed server backup in this configuration due to a combination of security policies (Cannot use one username:secret more than once) and strict naming convention: (All servers must follow a vp0-n naming convention with no number skipped)

Expected Results:

Either relaxing the security or naming convention rules to allow for failure scenarios where a server goes down and is able to rejoin by some method.

bmos299 commented 8 years ago

@RicHernandez2 this seems to be WAD. However, let's see what the component owners offer as a way to gracefully have a peer reenter the network after it has bounced.

corecode commented 8 years ago

shouldn't the VPs persist their certificates in some way, so that when they are brought up again, they already have a certificate?

tuand27613 commented 8 years ago

@adecaro , your thoughts ?

mastersingh24 commented 8 years ago

I do not think the CA is the problem here. If the peer has successfully logged in previously and connected to the network, you can definitely restart it (now there are/were errors in the log that say login fails, but the code actually checks for existence of the enrollment material after that failure). We have been doing this in Bluemix for a while. Maybe something changed in the code?

We don't use PBFT so maybe that is the difference here? But with security enabled we can definitely restart. I do this all the time on my laptop as well (but no PBFT)

mastersingh24 commented 8 years ago

@RicHernandez2 - I might have misunderstood. In this scenario, are you assuming we lose the physical / virtual server hosting the peer and then try to bring it back up on another physical / virtual host? In that case, I can see the issue with the enrollId / enrollSecret only being valid for one-time use.

bmos299 commented 8 years ago

@mastersingh24, not sure if @RicHernandez2 is on-line. I do think he is trying to describe the problem where a) the peer is not recoverable when using security because of the enroll id not reusable thus the peer can not get back into the consensus flow b) the naming convention being a limiting factor (e.g. vp0, etc.)

RicHernandez2 commented 8 years ago

Right, imagine a catastrophic failure on one of the host machines, motherboard goes out, drive failure, etc. To recover from this disaster, one could just spin up a new machine, but it would lack the certificates, it would not be able to use the former login name and it can't use the next login name due to the vp0-vpn-1 convention.

Let's say, in a set of 4 peers, vp2 fails and the hardware can't come up, could I run a commandline on another server, naming the new one vp2 in the commandline but giving vp4's enrollid and secret?

corecode commented 8 years ago

That's called membership change and is planed for later. Is it reasonable to assume that an institution would keep their certificate safe, so that they could re-deploy their machine?

RicHernandez2 commented 8 years ago

Right on, but it's good to know what, if any, disaster plan recovery is available, we can then communicate that to our customers who can mitigate that potential issue any number of ways. So shall I close as a punt till next version?

RicHernandez2 commented 8 years ago

Just talked to @bmos299, is there any chance we could get a hack we can document where we can recover from such a disaster? Keeping security intact but allowing a disaster recovery scenario

chetmurthy commented 8 years ago

@RicHernandez2 Rick, I don't follow why you're calling this "disaster recovery". Your test-scenario description is standard-issue "machine crashes; machine restarts; machine unable to rejoin", isn't it? This isn't disaster-recovery -- it's just plain old crash-fault-tolerance, no?

RicHernandez2 commented 8 years ago

Hey Chet, I think you're thinking of #925.

chetmurthy commented 8 years ago

Rick, I'm still confused. The description in the issue sure seems like an absence of crash-fault-tolerance, in a security-enabled setting, no?

RicHernandez2 commented 8 years ago

Well, 925 is more the crash fault tolerance bug, this is more of disaster recovery involving the replacement of one physical machine to another with the drive contents corrupt of the old box. How does one recover from such a disaster?

RicHernandez2 commented 8 years ago

But yes, this test involved security and privacy to be enabled, which, combined with the restrictions on sign-ins and vp names precludes a machine from being replaced where it failed catastrophically and the drive contents are irretrievable.

corecode commented 8 years ago

Is "make a copy of the keys" not a sufficiently good advice?

RicHernandez2 commented 8 years ago

Lets make sure we call out the files and locations of the certs that need to be copied and saved, I'll try and test this recovery scenario.

adecaro commented 8 years ago

@RicHernandez2, I agree with what corecode has written so far. Restarting from the same state is something currently allowed.

Regarding the location where the crypto material is stored, here it is:

'core.yaml#peer.fileSystemPath'/crypto/'role'/'name'/ks/raw

where role can be either client, peer or validator, and name is usually the enrollment ID.

I guess the labels bug and sev3 can be removed, right?

corecode commented 8 years ago

maybe we should have a command to backup the crypto material, and a command to inject it later again - together with behave tests ensuring that it works.

RicHernandez2 commented 8 years ago

I've been looking at ways I can test this other than shutting down one image and spinning up another, and I've come to realize that I can do a docker pause , then docker unpause to persist the state of the machine.

But in the wild, the chances that an administrator will encounter this situation where he can do a controlled pause seem less likely than someone tripping over a power cable or a power outage of some kind... so yes, a script to backup the certs would be a big help, and some documentation to help the admin of the system run the backup rebuild scripts and rebuild with certs would be awesome.

@muralisrini mentioned that if you do a docker -v you can point to a location on the host that would persist beyond lifespan of a docker image, and calling it in the same manner would have it pick up where it left off as far as certs are concerned. But since you don't know which docker image is going to fail or when, you'd essentially have to define a naming convention to isolate the certs on whatever host backup directory you designate. If you chose to go that route.

chetmurthy commented 8 years ago

Ric, if the story here is about copying the keys to a new machine, that is -strictly- forbidden in this part of distributed-systems land. Once a machine is partially corrupt, it is to be discarded -completely-.

Seriously, don't think about somehow recovering it from just the keys. That way lies a world of pain.

RicHernandez2 commented 8 years ago

OK, so no recovery through backup of certs, do we still agree that recovering from this state is a problem, should there not be some way by which a crashed container can be replaced by an admin, or does our poor admin just keep losing servers till consensus cannot be reached?

corecode commented 8 years ago

I think it should be perfectly fine to backup the certs, or have some other way of re-acquiring an equivalent set of certificates. Why would we have to change the consensus member list (not implemented yet) when a machine lost a hard drive?

On 04/26/2016 12:02 AM, RicHernandez2 wrote:

OK, so no recovery through backup of certs, do we still agree that recovering from this state is a problem, should there not be some way by which a crashed container can be replaced by an admin, or does our poor admin just keep losing servers till consensus cannot be reached?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/hyperledger/fabric/issues/880#issuecomment-214542096

RicHernandez2 commented 8 years ago

I think Chet was saying that's not a viable solution, perhaps referring to the securtiy vulnerabilities it comes with, or the load on the remaining VPs to catch up a fresh VP with no data but the right certs.

We used to have a similar failover plan at MSN Groups, 1 Read/Write DB and 3 read only DBs in replication mode. The process was we'd disconnect or power off one of the replicas till it's so far behind it can never hope to catch up through reconnection. We'd backup one of the current DBs, restore it on the failover one, start up the DB and let replication catch up the difference.

Worked for Groups.

ibmmark commented 8 years ago

@corecode ... Do you think this should remain open, I don't think it's really valid. Or at minimum should be a new feature request

corecode commented 8 years ago

@ibmmark I think this is something to at least put into some release notes: I think it should go something like this, but @binh needs to vet it: You need to maintain a copy of the cryptographic credentials of a peer [fill in details what to backup]. If you lose these credentials in a drive crash, the peer can no longer join the network. At the moment the network can not accept new peers after creation, so without backup of the security credentials you will permanently lose one peer.

hyperledger-archives / fabric

No disaster recovery possible in security, privacy enabled pbft consensus group #880