Closed sivaramsk closed 3 years ago
I did some testing with DR scenarios trying to leverage the BAF's design philosophy of considering K8S cluster as ephemeral. During my earlier testing I noticed that any of the deployed components when it get deleted is automatically deployed by the flux. So, I deleted a namespace org2-net in my network and flux automatically triggered the deployment of all the components in the deleted namesapce, and it got the above mentioned "permission denied" error. I deleted the vault-auth token and re-run the deploy-network.yaml and the error got cleared and CA and peer deployments were successfully completed. Few questions on CA needing the vault-auth
For the actual testing itself, once I cleared the vault-auth token and re-run deploy-network.yaml, CA server and peer got successfully deployed. But the deployment failed with the below error
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Could not find or access './build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem'\nSearched in:\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/tasks/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/tasks/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/shared/configuration/../../hyperledger-fabric/configuration/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/shared/configuration/../../hyperledger-fabric/configuration/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
PLAY RECAP *******************************************************************************************************************************************************************************************************************
localhost : ok=309 changed=99 unreachable=0 failed=1 skipped=435 rescued=0 ignored=0
I think the DR scenario is wrong. Because if you change the cluster itself, the volumes will be gone, and then setting up a network is like a new network itself. If we want to take into consideration the volumes as well, then we need to modify the BAF architecture itself to have their backups as well. Just by saving the certificates, wont recover the network. @sownak your views on this?
Thanks for your views @jagpreetsinghsasan . Let me just add my experience in recovering a Fabric 1.4.2 network. I had a cluster down scenario when the orderers disk became full. I took a backup of orderer data following this document - https://docs.google.com/document/d/1dEhUpMcqOYfOngDvSlyXL6NIBuTKmhP5s0fr_--U_OA/edit# and I was able to recover the entire cluster although there were multiple problems due to different block heights - the discussion about the same is here - https://chat.hyperledger.org/channel/fabric-questions?msg=QScfyHbu98q9oJsxd.
Assume that I have the orderers data's backup and I think we can do the DR in two steps
Another option is to do a velero backup and restore, which I think would work, but have to test it.
Yes, the orderer backup should work. That's what I was referring to when I said about volume backups. If that is done, then we should be able to achieve DR in BAF as well. As you have mentioned that it is failing in this case, this looks like a valid bug. Apart from resolving this bug, we should also have a check for CA's, as we wont need them when we do the redeployment (currently we wont have new certificates in vault, but the CA and its CLI will still come up)
Hi @sivaramsk, just to let you know: I'll be taking up this bug.
As per the discussion above, I'll be tackling it as follows:
Within the team, we suspect that the bug is caused due to some disks/folders not being mounted to volumes and thus being lost upon downscale of the cluster. I'll keep you up to date as we go, and please feel free to add any thoughts/experiences you have on this ticket!
Awesome. On a related note, I was also testing BAF with velero backup and restore. I hit a bug with velero restore which got fixed recently - https://github.com/vmware-tanzu/velero/issues/3027. I would test velero approach as well, and will document it.
So far, with the current scope of the story, I have found the following things:
Together with the backups that you have been researching, we should have a complete DR scenario. I'll create a PR for this soon. If you have any additional thoughts @sivaramsk, please let us know!
@abevers - quick question on your steps to reproduce. Can you explain what you mean by down-scaling the cluster? Scaling down the deployments to 0? If you take a real world scenario, say you lose access to a whole cluster where one of the org is deployed, user would follow the below steps to recover
Can you clarify whether your steps do the same or different? As for the vault issue you noticed, I think that is true even for a simple restart of CA pod, correct?
I still have not had the time to test the velero backup and restore yet, will update once I had that done.
Hi @sivaramsk, my scenario is not exactly the same. The scenario I have researched is a complete shutdown of the cluster (scaling deployments to 0) and then the same cluster restarting - without re-running the network.yaml
. I think your suggestion of pointing to a new cluster for 1 or more organizations is also very valid. I'll discuss this with the team.
Describe the bug As a part of DR testing, I was trying to recover the BAF deployments but could not recover BAF(fabric) from a kubernetes level failure.
To Reproduce Steps to reproduce the behavior:
PLAY RECAP *** localhost : ok=309 changed=99 unreachable=0 failed=1 skipped=435 rescued=0 ignored=0
Getting secrets from Vault Server: http://vault-test.eastus.azurecontainer.io:8200 { "errors": [ "permission denied" ] } ERROR: unable to retrieve vault login token: { "errors": [ "permission denied" ] }