Description

Create section in EKS docs on how to clone an instance

License

The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.

These steps currently don't work. Instead of a copy of the snapshotted -> new volume data showing up in the new CHT instance, there is a clean install of the CHT instead.

@henokgetachew suggests:

The volume you created is in the wrong availability zone. For the development EKS cluster - use eu-west-2b and For the prod EKS cluster - use eu-west-2a . You are trying to attach the volume in eu-west-2a to the dev cluster. That won't work. Can you change that and test?

So I'll delete the volume (and snapshot if it's a dev instance), update the steps in this PR and try again!

@henokgetachew - can you take another look at what I might be doing wrong? I deleted the volume I created before and then created a new one, being sure to specify the AZ:

$ aws ec2 create-volume --region eu-west-2 --snapshot-id snap-0d0840a657afe84e7 --availability-zone eu-west-2b

Here's the description from $ aws ec2 describe-volumes --region eu-west-2 --volume-id vol-0fee7609aa7757984 | jq:

{
  "Volumes": [
    {
      "Attachments": [],
      "AvailabilityZone": "eu-west-2b",
      "CreateTime": "2024-08-28T19:42:35.650000+00:00",
      "Encrypted": false,
      "Size": 900,
      "SnapshotId": "snap-0d0840a657afe84e7",
      "State": "available",
      "VolumeId": "vol-0fee7609aa7757984",
      "Iops": 2700,
      "Tags": [
        {
          "Key": "owner",
          "Value": "mrjones"
        },
        {
          "Key": "kubernetes.io/cluster/dev-cht-eks",
          "Value": "owned"
        },
        {
          "Key": "KubernetesCluster",
          "Value": "dev-cht-eks"
        },
        {
          "Key": "use",
          "Value": "allies-hosting-tco-testing"
        },
        {
          "Key": "snapshot-from",
          "Value": "moh-zanzibar-Aug-26-2024"
        }
      ],
      "VolumeType": "gp2",
      "MultiAttachEnabled": false
    }
  ]
}

I set the volume ID in my values file:

# tail -n4 mrjones.yml
remote:
  existingEBS: "true"
  existingEBSVolumeID: "vol-0fee7609aa7757984"
  existingEBSVolumeSize: "900Gi"

And then run deploy:

$ ./cht-deploy -f mrjones.yml     

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release exists. Performing upgrade.
Release "mrjones-dev" has been upgraded. Happy Helming!
NAME: mrjones-dev
LAST DEPLOYED: Wed Aug 28 13:17:27 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 2
TEST SUITE: None
Instance at https://mrjones.dev.medicmobile.org upgraded successfully.

However I get a 503 in the browser, despite all pods being up:

$ ./troubleshooting/list-all-resources mrjones-dev
NAME                                           READY   STATUS    RESTARTS   AGE
pod/cht-api-8554fc5b4c-sgqgt                   1/1     Running   0          20m
pod/cht-couchdb-f86c9cf47-jcsxl                1/1     Running   0          20m
pod/cht-haproxy-756f896d6d-s54ns               1/1     Running   0          20m
pod/cht-haproxy-healthcheck-7c8d4dbfb4-wtzsx   1/1     Running   0          20m
pod/cht-sentinel-7d8987d4db-m8tr2              1/1     Running   0          20m
pod/upgrade-service-67f48c5fc4-fs7fx           1/1     Running   0          20m

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/api               ClusterIP   172.20.25.243    <none>        5988/TCP                     20m
service/couchdb           ClusterIP   172.20.65.81     <none>        5984/TCP,4369/TCP,9100/TCP   20m
service/haproxy           ClusterIP   172.20.249.24    <none>        5984/TCP                     20m
service/healthcheck       ClusterIP   172.20.176.77    <none>        5555/TCP                     20m
service/upgrade-service   ClusterIP   172.20.125.132   <none>        5008/TCP                     20m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cht-api                   1/1     1            1           20m
deployment.apps/cht-couchdb               1/1     1            1           20m
deployment.apps/cht-haproxy               1/1     1            1           20m
deployment.apps/cht-haproxy-healthcheck   1/1     1            1           20m
deployment.apps/cht-sentinel              1/1     1            1           20m
deployment.apps/upgrade-service           1/1     1            1           20m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/cht-api-8554fc5b4c                   1         1         1       20m
replicaset.apps/cht-couchdb-f86c9cf47                1         1         1       20m
replicaset.apps/cht-haproxy-756f896d6d               1         1         1       20m
replicaset.apps/cht-haproxy-healthcheck-7c8d4dbfb4   1         1         1       20m
replicaset.apps/cht-sentinel-7d8987d4db              1         1         1       20m
replicaset.apps/upgrade-service-67f48c5fc4           1         1         1       20m

Here's my values file - password and secret changed to protect the inocent:

project_name: mrjones-dev 
namespace: "mrjones-dev"
chtversion: 4.5.2
#cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.
couchdb:
  password: hunter2
  secret: Correct-Horse-Battery-Staple 
  user: medic
  uuid: 1c9b420e-1847-49e9-9cdf-5350b32f6c85
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 20Gi
clusteredCouch:
  noOfCouchDBNodes: 1
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  host: "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"

environment: "remote"  # "local" or "remote"

remote:
  existingEBS: "true"
  existingEBSVolumeID: "vol-0fee7609aa7757984"
  existingEBSVolumeSize: "900Gi"

@mrjones-plip Okay I have finally figured out why this didn't work for you. values.yaml.

Your settings file:

You basically missed the main flag that tells helm to look for pre-existing volumes within the next sections. It should be configured like this:

I have tested this one and it has worked for me.

{
    "Volumes": [
        {
            "Attachments": [
                {
                    "AttachTime": "2024-08-30T16:49:08+00:00",
                    "Device": "/dev/xvdbh",
                    "InstanceId": "i-0ad3b6f9c8c82a5c9",
                    "State": "attached",
                    "VolumeId": "vol-0fee7609aa7757984",
                    "DeleteOnTermination": false
                }
            ],
            "AvailabilityZone": "eu-west-2b",
            "CreateTime": "2024-08-28T19:42:35.650000+00:00",
            "Encrypted": false,
            "Size": 900,
            "SnapshotId": "snap-0d0840a657afe84e7",
            "State": "in-use",
            "VolumeId": "vol-0fee7609aa7757984",
            "Iops": 2700,
            "Tags": [
                {
                    "Key": "owner",
                    "Value": "mrjones"
                },
                {
                    "Key": "kubernetes.io/cluster/dev-cht-eks",
                    "Value": "owned"
                },
                {
                    "Key": "KubernetesCluster",
                    "Value": "dev-cht-eks"
                },
                {
                    "Key": "use",
                    "Value": "allies-hosting-tco-testing"
                },
                {
                    "Key": "snapshot-from",
                    "Value": "moh-zanzibar-Aug-26-2024"
                }
            ],
            "VolumeType": "gp2",
            "MultiAttachEnabled": false
        }
    ]
}

Thanks @henokgetachew !

However, this is still not working :(

I've updated this PR with the exact steps I did. I'm wondering if all the IDs in my cloned instance need to match the production instance maybe?

Anyway, here's my values file with password changed:

project_name: "mrjones-dev"
namespace: "mrjones-dev" # e.g. "cht-dev-namespace"
chtversion: 4.5.2
# cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.

# Don't change upstream-servers unless you know what you're doing.
upstream_servers:
  docker_registry: "public.ecr.aws/medic"
  builds_url: "https://staging.dev.medicmobile.org/_couch/builds_4"
upgrade_service:
  tag: 0.32

# CouchDB Settings
couchdb:
  password: "changme" # Avoid using non-url-safe characters in password
  secret: "0b0802c7-f6e5-4b21-850a-3c43fed2f885" # Any value, e.g. a UUID.
  user: "medic"
  uuid: "d586f89b-e849-4327-a6a8-0def2161b501" # Any UUID
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 900Mi
clusteredCouch:
  noOfCouchDBNodes: 3
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::<account-id>:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  # Ensure the host is not already taken. Valid characters for a subdomain are:
  #   a-z, 0-9, and - (but not as first or last character).
  host: "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"

environment: "remote"  # "local", "remote"
cluster_type: "eks" # "eks" or "k3s-k3d"
cert_source: "eks-medic" # "eks-medic" or "specify-file-path" or "my-ip-co"
certificate_crt_file_path: "/path/to/certificate.crt" # Only required if cert_source is "specify-file-path"
certificate_key_file_path: "/path/to/certificate.key" # Only required if cert_source is "specify-file-path"

nodes:
  # If using clustered couchdb, add the nodes here: node-1: name-of-first-node, node-2: name-of-second-node, etc.
  # Add equal number of nodes as specified in clusteredCouch.noOfCouchDBNodes
  node-1: "" # This is the name of the first node where couchdb will be deployed
  node-2: "" # This is the name of the second node where couchdb will be deployed
  node-3: "" # This is the name of the third node where couchdb will be deployed
  # For single couchdb node, use the following:
  # Leave it commented out if you don't know what it means.
  # Leave it commented out if you want to let kubernetes deploy this on any available node. (Recommended)
  # single_node_deploy: "gamma-cht-node" # This is the name of the node where all components will be deployed - for non-clustered configuration. 

# Applicable only if using k3s
k3s_use_vSphere_storage_class: "false" # "true" or "false"
# vSphere specific configurations. If you set "true" for k3s_use_vSphere_storage_class, fill in the details below.
vSphere:
  datastoreName: "DatastoreName"  # Replace with your datastore name
  diskPath: "path/to/disk"         # Replace with your disk path

# -----------------------------------------
#       Pre-existing data section
# -----------------------------------------
couchdb_data:
  preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.

# If preExistingDataAvailable is true, fill in the details below.
# For local_storage, fill in the details if you are using k3s-k3d cluster type.
local_storage:  #If using k3s-k3d cluster type and you already have existing data.
  preExistingDiskPath-1: "/var/lib/couchdb1" #If node1 has pre-existing data.
  preExistingDiskPath-2: "/var/lib/couchdb2" #If node2 has pre-existing data.
  preExistingDiskPath-3: "/var/lib/couchdb3" #If node3 has pre-existing data.
# For ebs storage when using eks cluster type, fill in the details below.
ebs:
  preExistingEBSVolumeID: "vol-0fee7609aa7757984" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeSize: "900Gi" # The size of the EBS volume.

And the deploy goes well:

  deploy git:(master) ✗ ./cht-deploy -f mrjones-muso.yml 
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release exists. Performing upgrade.
Release "mrjones-dev" has been upgraded. Happy Helming!
NAME: mrjones-dev
LAST DEPLOYED: Fri Aug 30 21:57:00 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 2
TEST SUITE: None
Instance at https://mrjones.dev.medicmobile.org upgraded successfully.

And all the resources show as started:

NAME                                           READY   STATUS    RESTARTS   AGE
pod/cht-api-8554fc5b4c-xr79j                   1/1     Running   0          14m
pod/cht-couchdb-f86c9cf47-dvqdv                1/1     Running   0          14m
pod/cht-haproxy-756f896d6d-p58h6               1/1     Running   0          14m
pod/cht-haproxy-healthcheck-7c8d4dbfb4-z4wd5   1/1     Running   0          14m
pod/cht-sentinel-7d8987d4db-j44tz              1/1     Running   0          14m
pod/upgrade-service-67f48c5fc4-r9q7h           1/1     Running   0          14m

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/api               ClusterIP   172.20.0.14      <none>        5988/TCP                     14m
service/couchdb           ClusterIP   172.20.192.240   <none>        5984/TCP,4369/TCP,9100/TCP   14m
service/haproxy           ClusterIP   172.20.8.14      <none>        5984/TCP                     14m
service/healthcheck       ClusterIP   172.20.92.132    <none>        5555/TCP                     14m
service/upgrade-service   ClusterIP   172.20.233.206   <none>        5008/TCP                     14m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cht-api                   1/1     1            1           14m
deployment.apps/cht-couchdb               1/1     1            1           14m
deployment.apps/cht-haproxy               1/1     1            1           14m
deployment.apps/cht-haproxy-healthcheck   1/1     1            1           14m
deployment.apps/cht-sentinel              1/1     1            1           14m
deployment.apps/upgrade-service           1/1     1            1           14m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/cht-api-8554fc5b4c                   1         1         1       14m
replicaset.apps/cht-couchdb-f86c9cf47                1         1         1       14m
replicaset.apps/cht-haproxy-756f896d6d               1         1         1       14m
replicaset.apps/cht-haproxy-healthcheck-7c8d4dbfb4   1         1         1       14m
replicaset.apps/cht-sentinel-7d8987d4db              1         1         1       14m
replicaset.apps/upgrade-service-67f48c5fc4           1         1         1       14m

But I get a 502 - bad gateway in the browser.

Couch seems in a bad way, which is likely the main problem:

[warning] 2024-08-31T04:49:55.337953Z couchdb@127.0.0.1 <0.1449.0> e669322402 couch_httpd_auth: Authentication failed for user medic from 100.64.213.104
[notice] 2024-08-31T04:49:55.338171Z couchdb@127.0.0.1 <0.1449.0> e669322402 couchdb.mrjones-dev.svc.cluster.local:5984 100.64.213.104 undefined GET /_membership 401 ok 1
[notice] 2024-08-31T04:49:55.703891Z couchdb@127.0.0.1 <0.394.0> -------- chttpd_auth_cache changes listener died because the _users database does not exist. Create the database to silence this notice.
[error] 2024-08-31T04:49:55.704074Z couchdb@127.0.0.1 emulator -------- Error in process <0.1467.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}

[error] 2024-08-31T04:49:55.704137Z couchdb@127.0.0.1 emulator -------- Error in process <0.1467.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}

With couch down, it's not worth checking, but API and sentinel are unhappy - they both have near identical 503 errors:

StatusCodeError: 503 - {"error":"503 Service Unavailable","reason":"No server is available to handle this request","server":"haproxy"}
    at new StatusCodeError (/service/api/node_modules/request-promise-core/lib/errors.js:32:15)
    at Request.plumbing.callback (/service/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
    at Request.RP$callback [as _callback] (/service/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at Request.self.callback (/service/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:513:28)
    at Request.<anonymous> (/service/api/node_modules/request/request.js:1154:10)
    at Request.emit (node:events:513:28)
    at IncomingMessage.<anonymous> (/service/api/node_modules/request/request.js:1076:12)
    at Object.onceWrapper (node:events:627:28)
    at IncomingMessage.emit (node:events:525:35) {
  statusCode: 503,
  error: {
    error: '503 Service Unavailable',
    reason: 'No server is available to handle this request',
    server: 'haproxy'
  }
}

HA Proxy is unsurprisingly 503ing:

<150>Aug 31 04:58:22 haproxy[12]: 100.64.213.102,<NOSRV>,503,0,0,0,GET,/,-,medic,'-',241,-1,-,'-'

@dianabarsan and I did deep dive into this today and my test instance now starts up instead of 502ing! However, it's on a clean install of CHT core instead of showing the cloned prod data.

At this point we suspect it might be a permissions error maybe? Per below, the volume mounts but we can't see any of the data, so that's our guess.

We found out that:

It was important to comment out the four lines starting with nodes: in the local storage section.

the volume is indeed being mounted in the cht-couchdb pod

$ kubectl -n mrjones-dev exec -it cht-couchdb-f86c9cf47-5msts -- df -h             
Filesystem      Size  Used Avail Use% Mounted on
overlay         485G   40G  445G   9% /
/dev/nvme2n1    886G   71G  816G   8% /opt/couchdb/data

but there's simply no data in it:

$ kubectl -n mrjones-dev exec -it cht-couchdb-f86c9cf47-5msts -- du --max-depth=1 -h /opt/couchdb/data
1.8M    /opt/couchdb/data/.shards
5.0M    /opt/couchdb/data/shards
4.0K    /opt/couchdb/data/.delete
6.9M    /opt/couchdb/data

looking at the prod instance where it was cloned from, it's mounted in the same path AND there's data in it:


kubectl config set-context arn:aws:eks:eu-west-2:720541322708:cluster/prod-cht-eks

kubectl -n moh-zanzibar-prod exec -it cht-couchdb-1-cb788fc65-vjsn5  -- df -h     
Filesystem      Size  Used Avail Use% Mounted on
overlay         485G   14G  472G   3% /
/dev/nvme1n1    886G   67G  820G   8% /opt/couchdb/data

kubectl -n moh-zanzibar-prod exec -it cht-couchdb-1-cb788fc65-vjsn5  -- du --max-depth=1 -h /opt/couchdb/data
12K /opt/couchdb/data/._users_design
12K /opt/couchdb/data/._replicator_design
33G /opt/couchdb/data/.shards
31G /opt/couchdb/data/shards
4.0K    /opt/couchdb/data/.delete
63G /opt/couchdb/data

Just to be safe, we set the password: and secret: and user: in the values file to be identical to the prod instances we cloned, and this did not fix things.
To be extra sure the data on the volume was still "valid" (in quotes because I don't know why it wouldn't be valid?!?) - I made a new clone of the volume (see vol-0cf04a56d8d59f74b) off the most recent snapshot (see snap-01a976f6a4e51684c), being sure to set all the tags correctly. This failed in the same way as above (mounted correctly, but clean install)

I have some downtime today. I will try to have a look if it's a quick thing.

Pushed a PR here. Let me know if that solves it.

Thanks so much for coming off your holiday to do some work!

Per my slack comment, I don't know how to test this branch in the cht-conf script

It doesn't release beta builds for now. If the code looks good for you then only way to test right now is to approve and merge the PR which should release a patch version of the helm charts which cht-deploy will pick when deploying

Despite it only being 7 lines of change, I'm not really in a position to know if these changes look good. I don't know helm, I don't know EKS and I believe these charts are used for every production CHT Core instance we run - which gives me pause.

I would very much like to be able to test this or defer to someone else who knows what these changes actually do.

I'll pursue the idea of running the changes manually via helm install... per this slack thread and see how far I can get.

I tried this just now and got the same result:

fully remove the current deployment: helm delete mrjones-dev --namespace mrjones-dev

make sure I'm on the correct branch:

git status
On branch user-root-for-couchdb-container
Your branch is up to date with 'origin/user-root-for-couchdb-container'.

run the extracted helm upgrade command, passing in the full path to the branch of helm charts with the changes to test: helm upgrade mrjones-dev /home/mrjones/Documents/MedicMobile/helm-charts/charts/cht-chart-4x --install --version 1.0.* --namespace mrjones-dev --values mrjones-muso.yml --set cht_image_tag=4.5.2

note that it runs successfully:

mrjones-muso.yml --set cht_image_tag=4.5.2                                                
Release "mrjones-dev" does not exist. Installing it now.                
NAME: mrjones-dev
LAST DEPLOYED: Thu Sep  5 15:22:10 2024
NAMESPACE: mrjones-dev                                                                   
STATUS: deployed            
REVISION: 1                     
TEST SUITE: None

check that the volume is mounted: kubectl -n mrjones-dev exec -it cht-couchdb-57c74f9fc-qtrx5 -- df -h:

Filesystem      Size  Used Avail Use% Mounted on                        
overlay         485G   42G  444G   9% /
/dev/nvme2n1    886G   67G  820G   8% /opt/couchdb/data

check that there's actually a lot of data in the volume: kubectl -n mrjones-dev exec -it cht-couchdb-57c74f9fc-qtrx5 -- du --max-depth=1 -h /opt/couchdb/data:
```
1.9M    /opt/couchdb/data/.shards
5.0M    /opt/couchdb/data/shards
4.0K    /opt/couchdb/data/.delete
6.9M    /opt/couchdb/data
```

medic / cht-docs

Create section in EKS docs on how to clone an instance #1502

Description

License