hyperledger / bevel

An automation framework for rapidly and consistently deploying production-ready DLT platforms
https://hyperledger-bevel.readthedocs.io/en/latest/
Apache License 2.0
342 stars 709 forks source link

CA does not come up when redeploying an org as part of a DR testing #1117

Closed sivaramsk closed 3 years ago

sivaramsk commented 3 years ago

Describe the bug As part of a DR testing, I tried to simulate a lost organization and tried to deploy the organization again, but the CA did not come up.

To Reproduce Steps to reproduce the behavior:

  1. Deploy a fabric network with an orderer org, org1-net (1 peer), org2-net (1 peer).
  2. Delete the namespace org2-net.
  3. Re-run the deploy-network.yaml playbook so the org2-net gets re-deployed on the network again.

Expected behavior CA and the peer node of the org2 is expected to come up and join the network

Screenshots But the CA pod has issues coming up with the below error

siva@MacBook in ~/projects/go/src/blockchain-automation-framework on  fabric220 via ⬒ v14.14.0
πŸ•™ [19-Oct-2020 09:51:38 PM ] ❯ k logs -n org2-net ca-d59bdcf45-trtcn ca-certs-init
Getting secrets from Vault Server: http://vault-test.eastus.azurecontainer.io:8200
{ "errors": [ "permission denied" ] }
ERROR: unable to retrieve vault login token: {
  "errors": [
    "permission denied"
  ]
}

Environment (please complete the following information):

Additional context This test is part of my conversation in the rocket chat - https://chat.hyperledger.org/channel/blockchain-automation-framework?msg=epvnAJNwXR7YNEqgv

sownak commented 3 years ago

@sivaramsk are all the resources like vault-reviewer, namespace? Most likely this seems that the vault-auth needs to be recreated which is not happening because the items are there on Vault.

sivaramsk commented 3 years ago

I tested to confirm whether the secrets are getting created, I do see vault related secrets getting created after I run deploy-network.yaml

πŸ•™ [23-Oct-2020 08:06:49 AM ] ❯ kg secrets -n org2-net
NAME                         TYPE                                  DATA   AGE
default-token-4qk6q          kubernetes.io/service-account-token   3      6m6s
regcred                      kubernetes.io/dockerconfigjson        1      3m4s
vault-auth-token-spcmh       kubernetes.io/service-account-token   3      6m6s
vault-reviewer-token-rqwg6   kubernetes.io/service-account-token   3      6m5s
sivaramsk commented 3 years ago

I did another test,

  1. Deploy fabric on a k8s cluster(1st-cluster)
  2. Create a new k8s cluster(2nd-cluster)
  3. Re-run the deploy-network.yaml against the new k8s cluster.

Hard for me to explain the problem, but I will try. The secrets gets created like I described above, and the ca-server has the same issue like above.

siva@MacBook in ~/projects/go/src/andromeda-2 on  master via πŸ’  default took 6s
πŸ•™ [23-Oct-2020 11:22:17 AM ] ❯ kg secrets -n org1-net
NAME                                                   TYPE                                  DATA   AGE
azure-storage-account-f0f56c83d25474823ae035c-secret   Opaque                                2      4m9s
azure-storage-account-f1eb3c9678bdb40448e4631-secret   Opaque                                2      4m3s
default-token-b5jnd                                    kubernetes.io/service-account-token   3      4m39s
regcred                                                kubernetes.io/dockerconfigjson        1      36s
vault-auth-token-rn4zt                                 kubernetes.io/service-account-token   3      4m39s
vault-reviewer-token-tpnms                             kubernetes.io/service-account-token   3      4m39s

Every pod that was running in the 1st-cluster, gets started in the 2nd-cluster at the same time at some point in the deploy network each throwing an error. Very hard to explain what I see there.

Screenshot 2020-10-23 at 11 18 21 AM
sownak commented 3 years ago

I think we will have to delete the auth-path from Vault before running the deploy-network again. Because as per following: the REVIEWER_TOKEN is regenerated, as per the secret, but this command is not run if the auth-path already exists. image

sivaramsk commented 3 years ago

@sownak - I don't understand the below path

"vault write auth/{{ auth_path }}/config token_reviewer_jwt="$REVIEWER_TOKEN""

Where in the vault is auth/? The "vault secrets list" command gives me the below output, I don't see a auth under that list

πŸ•™ [23-Oct-2020 03:38:43 PM ] ❯ vault secrets list
Path          Type         Accessor              Description
----          ----         --------              -----------
cubbyhole/    cubbyhole    cubbyhole_11a9cedc    per-token private secret storage
identity/     identity     identity_1fcdca0b     identity store
secret/       kv           kv_d1ec59c3           n/a
sys/          system       system_cb16fbf3       system endpoints used for control, policy and debugging

Can you clarify how to delete this token?

sivaramsk commented 3 years ago

@sownak - I can confirm once I deleted auth-path in the vault, the ca pods came up and the orderers and peers also came up.

Few observations:

  1. As soon as the flux got deployed in the cluster, it deployed all of the pods and that is the reason for the above screenshot I pasted I think. Does flux automatically deploy a deployment if it is not available in the cluster?
  2. Storageclass for organizations did not get created automatically, so, manually created it to move the script.
  3. I still could not make it work in the same cluster. Like I delete a namespace "org2-net" for example, flux deploy's those pods after sometime, but the pods are still failing with the authentication error like above. I don't understand how to debug this issue.
  4. The deploy-network.yaml still failed with the below error, although all the ca, orderers and peers are all up.
    
    TASK [create/crypto/peer : Copy msp cacerts from auto-generated path to given path] ******************************************************************************************************************************************
    An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
    fatal: [localhost]: FAILED! => {"changed": false, "msg": "Could not find or access './build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem'\nSearched in:\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/tasks/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/hyperledger-fabric/configuration/roles/create/crypto/peer/tasks/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/shared/configuration/../../hyperledger-fabric/configuration/files/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem\n\t/home/blockchain-automation-framework/platforms/shared/configuration/../../hyperledger-fabric/configuration/./build/crypto-config/peerOrganizations/org1-net/peers/peer0.org1-net/msp/cacerts/ca-org1-net-7054.pem on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}

PLAY RECAP *** localhost : ok=309 changed=99 unreachable=0 failed=1 skipped=435 rescued=0 ignored=0



I am not sure whether I am testing the right thing here. What I am trying to confirm is, say if I lose kubernetes cluster which runs 1 or 2 organizations or everything, how do I recover a BAF cluster?

In the current method we deploy the fabric in kubernetes, I was able to sucessfully recover the network using velero backup with a bit of manaul wrangling. 
sivaramsk commented 3 years ago

Closing this ticket as the CA had actually come up. I am going to open a specific ticket to discuss BAF DR.