Disaster Recovery in BAF

sivaramsk commented 3 years ago

Describe the bug As a part of DR testing, I was trying to recover the BAF deployments but could not recover BAF(fabric) from a kubernetes level failure.

To Reproduce Steps to reproduce the behavior:

Install fabric network using BAF with one orderer and two organizations(1 peer)
After successful deployment of BAF, change the kubernetes configuration for all the organizations in the network.yaml to another working kubernetes cluster.
Run the deploy-network.yaml

deploy-network.yaml fails with the below error


An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Could not find or access './build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem'\nSearched in:\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/tasks/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/tasks/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/shared/configuration/../../hyperledger-fabric/configuration/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/shared/configuration/../../hyperledger-fabric/configuration/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}

PLAY RECAP *** localhost : ok=309 changed=99 unreachable=0 failed=1 skipped=435 rescued=0 ignored=0


**Expected behavior**
The deployment of the network should go through. 

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Environment (please complete the following information):**
 - OS: From docker container
 - Version [e.g. 22]
 - Cloud environment: AKS
 - K8S Version: 1.17.11

**Additional context**
I tried the above test on the same kubernetes cluster the network was deployed on by deleting namespace of an organization (ca, ca-tools, peer, pvc, services..etc are deleted). When I run the deploy-network.yaml after deleting the namespace, I get the beow error

Getting secrets from Vault Server: http://vault-test.eastus.azurecontainer.io:8200 { "errors": [ "permission denied" ] } ERROR: unable to retrieve vault login token: { "errors": [ "permission denied" ] }

sivaramsk commented 3 years ago

I did some testing with DR scenarios trying to leverage the BAF's design philosophy of considering K8S cluster as ephemeral. During my earlier testing I noticed that any of the deployed components when it get deleted is automatically deployed by the flux. So, I deleted a namespace org2-net in my network and flux automatically triggered the deployment of all the components in the deleted namesapce, and it got the above mentioned "permission denied" error. I deleted the vault-auth token and re-run the deploy-network.yaml and the error got cleared and CA and peer deployments were successfully completed. Few questions on CA needing the vault-auth

If a CA server restarts for some reason, would it fail to read the vault-auth token just like above?
Is CA server is the only one that needs vault-auth token?

For the actual testing itself, once I cleared the vault-auth token and re-run deploy-network.yaml, CA server and peer got successfully deployed. But the deployment failed with the below error

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Could not find or access './build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem'\nSearched in:\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/tasks/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/tasks/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/shared/configuration/../../hyperledger-fabric/configuration/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/shared/configuration/../../hyperledger-fabric/configuration/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}

PLAY RECAP *******************************************************************************************************************************************************************************************************************
localhost                  : ok=309  changed=99   unreachable=0    failed=1    skipped=435  rescued=0    ignored=0

How can I overcome this error?
This peer is already part of the channel, what will happen when the joinchannel or anchorupdatepeer jobs gets triggered again by the deploy-network.yaml?
Again, am I in the right path to test a DR scenario?

jagpreetsinghsasan commented 3 years ago

I think the DR scenario is wrong. Because if you change the cluster itself, the volumes will be gone, and then setting up a network is like a new network itself. If we want to take into consideration the volumes as well, then we need to modify the BAF architecture itself to have their backups as well. Just by saving the certificates, wont recover the network. @sownak your views on this?

sivaramsk commented 3 years ago

Thanks for your views @jagpreetsinghsasan . Let me just add my experience in recovering a Fabric 1.4.2 network. I had a cluster down scenario when the orderers disk became full. I took a backup of orderer data following this document - https://docs.google.com/document/d/1dEhUpMcqOYfOngDvSlyXL6NIBuTKmhP5s0fr_--U_OA/edit# and I was able to recover the entire cluster although there were multiple problems due to different block heights - the discussion about the same is here - https://chat.hyperledger.org/channel/fabric-questions?msg=QScfyHbu98q9oJsxd.

Assume that I have the orderers data's backup and I think we can do the DR in two steps

Re-run the deploy-network.yaml with the same configuration
Restore the orderer data from a data backup I have so, orderers and peers can sync from the restored data.

Another option is to do a velero backup and restore, which I think would work, but have to test it.

jagpreetsinghsasan commented 3 years ago

Yes, the orderer backup should work. That's what I was referring to when I said about volume backups. If that is done, then we should be able to achieve DR in BAF as well. As you have mentioned that it is failing in this case, this looks like a valid bug. Apart from resolving this bug, we should also have a check for CA's, as we wont need them when we do the redeployment (currently we wont have new certificates in vault, but the CA and its CLI will still come up)

abevers commented 3 years ago

Hi @sivaramsk, just to let you know: I'll be taking up this bug.

As per the discussion above, I'll be tackling it as follows:

Deploy the network on a cluster
Upon deployment completion, scale down the cluster to 0 nodes
After some time, scale it back up and see if the cluster recreates the network properly.

Within the team, we suspect that the bug is caused due to some disks/folders not being mounted to volumes and thus being lost upon downscale of the cluster. I'll keep you up to date as we go, and please feel free to add any thoughts/experiences you have on this ticket!

sivaramsk commented 3 years ago

Awesome. On a related note, I was also testing BAF with velero backup and restore. I hit a bug with velero restore which got fixed recently - https://github.com/vmware-tanzu/velero/issues/3027. I would test velero approach as well, and will document it.

abevers commented 3 years ago

So far, with the current scope of the story, I have found the following things:

The CouchDB of the Peer is not mounted to a volume, and thus is lost upon downscale of the cluster. I have bound it to a PVC, which in turn is bound to the storageclass.
I have found issues with our Vault in the containers. It does not check if the Vault is reachable or sealed, but still continues starting up. It will try to fetch certificates/secrets from the Vault, can't reach the Vault but continues the deployment leading to missing/empty files. This may cause issues when you upscale the cluster again and the Vault is part of the same cluster. I have created a bug for this (#1179) and this will be picked up in a future sprint.

Together with the backups that you have been researching, we should have a complete DR scenario. I'll create a PR for this soon. If you have any additional thoughts @sivaramsk, please let us know!

sivaramsk commented 3 years ago

@abevers - quick question on your steps to reproduce. Can you explain what you mean by down-scaling the cluster? Scaling down the deployments to 0? If you take a real world scenario, say you lose access to a whole cluster where one of the org is deployed, user would follow the below steps to recover

Create a new K8S cluster.
Adjust the network.yaml to point to the new K8S cluster.
Re-run the deploy-network.yaml

Can you clarify whether your steps do the same or different? As for the vault issue you noticed, I think that is true even for a simple restart of CA pod, correct?

I still have not had the time to test the velero backup and restore yet, will update once I had that done.

abevers commented 3 years ago

Hi @sivaramsk, my scenario is not exactly the same. The scenario I have researched is a complete shutdown of the cluster (scaling deployments to 0) and then the same cluster restarting - without re-running the network.yaml. I think your suggestion of pointing to a new cluster for 1 or more organizations is also very valid. I'll discuss this with the team.

hyperledger / bevel

Disaster Recovery in BAF #1132