Open mrjones-plip opened 3 weeks ago
These steps currently don't work. Instead of a copy of the snapshotted -> new volume data showing up in the new CHT instance, there is a clean install of the CHT instead.
@henokgetachew suggests:
The volume you created is in the wrong availability zone. For the development EKS cluster - use
eu-west-2b
and For the prod EKS cluster - useeu-west-2a
. You are trying to attach the volume in eu-west-2a to the dev cluster. That won't work. Can you change that and test?
So I'll delete the volume (and snapshot if it's a dev instance), update the steps in this PR and try again!
@henokgetachew - can you take another look at what I might be doing wrong? I deleted the volume I created before and then created a new one, being sure to specify the AZ:
$ aws ec2 create-volume --region eu-west-2 --snapshot-id snap-0d0840a657afe84e7 --availability-zone eu-west-2b
Here's the description from $ aws ec2 describe-volumes --region eu-west-2 --volume-id vol-0fee7609aa7757984 | jq
:
{
"Volumes": [
{
"Attachments": [],
"AvailabilityZone": "eu-west-2b",
"CreateTime": "2024-08-28T19:42:35.650000+00:00",
"Encrypted": false,
"Size": 900,
"SnapshotId": "snap-0d0840a657afe84e7",
"State": "available",
"VolumeId": "vol-0fee7609aa7757984",
"Iops": 2700,
"Tags": [
{
"Key": "owner",
"Value": "mrjones"
},
{
"Key": "kubernetes.io/cluster/dev-cht-eks",
"Value": "owned"
},
{
"Key": "KubernetesCluster",
"Value": "dev-cht-eks"
},
{
"Key": "use",
"Value": "allies-hosting-tco-testing"
},
{
"Key": "snapshot-from",
"Value": "moh-zanzibar-Aug-26-2024"
}
],
"VolumeType": "gp2",
"MultiAttachEnabled": false
}
]
}
I set the volume ID in my values file:
# tail -n4 mrjones.yml
remote:
existingEBS: "true"
existingEBSVolumeID: "vol-0fee7609aa7757984"
existingEBSVolumeSize: "900Gi"
And then run deploy:
$ ./cht-deploy -f mrjones.yml
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release exists. Performing upgrade.
Release "mrjones-dev" has been upgraded. Happy Helming!
NAME: mrjones-dev
LAST DEPLOYED: Wed Aug 28 13:17:27 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 2
TEST SUITE: None
Instance at https://mrjones.dev.medicmobile.org upgraded successfully.
However I get a 503
in the browser, despite all pods being up:
$ ./troubleshooting/list-all-resources mrjones-dev
NAME READY STATUS RESTARTS AGE
pod/cht-api-8554fc5b4c-sgqgt 1/1 Running 0 20m
pod/cht-couchdb-f86c9cf47-jcsxl 1/1 Running 0 20m
pod/cht-haproxy-756f896d6d-s54ns 1/1 Running 0 20m
pod/cht-haproxy-healthcheck-7c8d4dbfb4-wtzsx 1/1 Running 0 20m
pod/cht-sentinel-7d8987d4db-m8tr2 1/1 Running 0 20m
pod/upgrade-service-67f48c5fc4-fs7fx 1/1 Running 0 20m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/api ClusterIP 172.20.25.243 <none> 5988/TCP 20m
service/couchdb ClusterIP 172.20.65.81 <none> 5984/TCP,4369/TCP,9100/TCP 20m
service/haproxy ClusterIP 172.20.249.24 <none> 5984/TCP 20m
service/healthcheck ClusterIP 172.20.176.77 <none> 5555/TCP 20m
service/upgrade-service ClusterIP 172.20.125.132 <none> 5008/TCP 20m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cht-api 1/1 1 1 20m
deployment.apps/cht-couchdb 1/1 1 1 20m
deployment.apps/cht-haproxy 1/1 1 1 20m
deployment.apps/cht-haproxy-healthcheck 1/1 1 1 20m
deployment.apps/cht-sentinel 1/1 1 1 20m
deployment.apps/upgrade-service 1/1 1 1 20m
NAME DESIRED CURRENT READY AGE
replicaset.apps/cht-api-8554fc5b4c 1 1 1 20m
replicaset.apps/cht-couchdb-f86c9cf47 1 1 1 20m
replicaset.apps/cht-haproxy-756f896d6d 1 1 1 20m
replicaset.apps/cht-haproxy-healthcheck-7c8d4dbfb4 1 1 1 20m
replicaset.apps/cht-sentinel-7d8987d4db 1 1 1 20m
replicaset.apps/upgrade-service-67f48c5fc4 1 1 1 20m
Here's my values file - password and secret changed to protect the inocent:
project_name: mrjones-dev
namespace: "mrjones-dev"
chtversion: 4.5.2
#cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.
couchdb:
password: hunter2
secret: Correct-Horse-Battery-Staple
user: medic
uuid: 1c9b420e-1847-49e9-9cdf-5350b32f6c85
clusteredCouch_enabled: false
couchdb_node_storage_size: 20Gi
clusteredCouch:
noOfCouchDBNodes: 1
toleration: # This is for the couchdb pods. Don't change this unless you know what you're doing.
key: "dev-couchdb-only"
operator: "Equal"
value: "true"
effect: "NoSchedule"
ingress:
annotations:
groupname: "dev-cht-alb"
tags: "Environment=dev,Team=QA"
certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
host: "mrjones.dev.medicmobile.org"
hosted_zone_id: "Z3304WUAJTCM7P"
load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"
environment: "remote" # "local" or "remote"
remote:
existingEBS: "true"
existingEBSVolumeID: "vol-0fee7609aa7757984"
existingEBSVolumeSize: "900Gi"
@mrjones-plip Okay I have finally figured out why this didn't work for you. values.yaml.
Your settings file:
You basically missed the main flag that tells helm to look for pre-existing volumes within the next sections. It should be configured like this:
I have tested this one and it has worked for me.
{
"Volumes": [
{
"Attachments": [
{
"AttachTime": "2024-08-30T16:49:08+00:00",
"Device": "/dev/xvdbh",
"InstanceId": "i-0ad3b6f9c8c82a5c9",
"State": "attached",
"VolumeId": "vol-0fee7609aa7757984",
"DeleteOnTermination": false
}
],
"AvailabilityZone": "eu-west-2b",
"CreateTime": "2024-08-28T19:42:35.650000+00:00",
"Encrypted": false,
"Size": 900,
"SnapshotId": "snap-0d0840a657afe84e7",
"State": "in-use",
"VolumeId": "vol-0fee7609aa7757984",
"Iops": 2700,
"Tags": [
{
"Key": "owner",
"Value": "mrjones"
},
{
"Key": "kubernetes.io/cluster/dev-cht-eks",
"Value": "owned"
},
{
"Key": "KubernetesCluster",
"Value": "dev-cht-eks"
},
{
"Key": "use",
"Value": "allies-hosting-tco-testing"
},
{
"Key": "snapshot-from",
"Value": "moh-zanzibar-Aug-26-2024"
}
],
"VolumeType": "gp2",
"MultiAttachEnabled": false
}
]
}
Thanks @henokgetachew !
However, this is still not working :(
I've updated this PR with the exact steps I did. I'm wondering if all the IDs in my cloned instance need to match the production instance maybe?
Anyway, here's my values file with password
changed:
project_name: "mrjones-dev"
namespace: "mrjones-dev" # e.g. "cht-dev-namespace"
chtversion: 4.5.2
# cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.
# Don't change upstream-servers unless you know what you're doing.
upstream_servers:
docker_registry: "public.ecr.aws/medic"
builds_url: "https://staging.dev.medicmobile.org/_couch/builds_4"
upgrade_service:
tag: 0.32
# CouchDB Settings
couchdb:
password: "changme" # Avoid using non-url-safe characters in password
secret: "0b0802c7-f6e5-4b21-850a-3c43fed2f885" # Any value, e.g. a UUID.
user: "medic"
uuid: "d586f89b-e849-4327-a6a8-0def2161b501" # Any UUID
clusteredCouch_enabled: false
couchdb_node_storage_size: 900Mi
clusteredCouch:
noOfCouchDBNodes: 3
toleration: # This is for the couchdb pods. Don't change this unless you know what you're doing.
key: "dev-couchdb-only"
operator: "Equal"
value: "true"
effect: "NoSchedule"
ingress:
annotations:
groupname: "dev-cht-alb"
tags: "Environment=dev,Team=QA"
certificate: "arn:aws:iam::<account-id>:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
# Ensure the host is not already taken. Valid characters for a subdomain are:
# a-z, 0-9, and - (but not as first or last character).
host: "mrjones.dev.medicmobile.org"
hosted_zone_id: "Z3304WUAJTCM7P"
load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"
environment: "remote" # "local", "remote"
cluster_type: "eks" # "eks" or "k3s-k3d"
cert_source: "eks-medic" # "eks-medic" or "specify-file-path" or "my-ip-co"
certificate_crt_file_path: "/path/to/certificate.crt" # Only required if cert_source is "specify-file-path"
certificate_key_file_path: "/path/to/certificate.key" # Only required if cert_source is "specify-file-path"
nodes:
# If using clustered couchdb, add the nodes here: node-1: name-of-first-node, node-2: name-of-second-node, etc.
# Add equal number of nodes as specified in clusteredCouch.noOfCouchDBNodes
node-1: "" # This is the name of the first node where couchdb will be deployed
node-2: "" # This is the name of the second node where couchdb will be deployed
node-3: "" # This is the name of the third node where couchdb will be deployed
# For single couchdb node, use the following:
# Leave it commented out if you don't know what it means.
# Leave it commented out if you want to let kubernetes deploy this on any available node. (Recommended)
# single_node_deploy: "gamma-cht-node" # This is the name of the node where all components will be deployed - for non-clustered configuration.
# Applicable only if using k3s
k3s_use_vSphere_storage_class: "false" # "true" or "false"
# vSphere specific configurations. If you set "true" for k3s_use_vSphere_storage_class, fill in the details below.
vSphere:
datastoreName: "DatastoreName" # Replace with your datastore name
diskPath: "path/to/disk" # Replace with your disk path
# -----------------------------------------
# Pre-existing data section
# -----------------------------------------
couchdb_data:
preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.
# If preExistingDataAvailable is true, fill in the details below.
# For local_storage, fill in the details if you are using k3s-k3d cluster type.
local_storage: #If using k3s-k3d cluster type and you already have existing data.
preExistingDiskPath-1: "/var/lib/couchdb1" #If node1 has pre-existing data.
preExistingDiskPath-2: "/var/lib/couchdb2" #If node2 has pre-existing data.
preExistingDiskPath-3: "/var/lib/couchdb3" #If node3 has pre-existing data.
# For ebs storage when using eks cluster type, fill in the details below.
ebs:
preExistingEBSVolumeID: "vol-0fee7609aa7757984" # If you have already created the EBS volume, put the ID here.
preExistingEBSVolumeSize: "900Gi" # The size of the EBS volume.
And the deploy goes well:
deploy git:(master) ✗ ./cht-deploy -f mrjones-muso.yml
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release exists. Performing upgrade.
Release "mrjones-dev" has been upgraded. Happy Helming!
NAME: mrjones-dev
LAST DEPLOYED: Fri Aug 30 21:57:00 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 2
TEST SUITE: None
Instance at https://mrjones.dev.medicmobile.org upgraded successfully.
And all the resources show as started:
NAME READY STATUS RESTARTS AGE
pod/cht-api-8554fc5b4c-xr79j 1/1 Running 0 14m
pod/cht-couchdb-f86c9cf47-dvqdv 1/1 Running 0 14m
pod/cht-haproxy-756f896d6d-p58h6 1/1 Running 0 14m
pod/cht-haproxy-healthcheck-7c8d4dbfb4-z4wd5 1/1 Running 0 14m
pod/cht-sentinel-7d8987d4db-j44tz 1/1 Running 0 14m
pod/upgrade-service-67f48c5fc4-r9q7h 1/1 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/api ClusterIP 172.20.0.14 <none> 5988/TCP 14m
service/couchdb ClusterIP 172.20.192.240 <none> 5984/TCP,4369/TCP,9100/TCP 14m
service/haproxy ClusterIP 172.20.8.14 <none> 5984/TCP 14m
service/healthcheck ClusterIP 172.20.92.132 <none> 5555/TCP 14m
service/upgrade-service ClusterIP 172.20.233.206 <none> 5008/TCP 14m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cht-api 1/1 1 1 14m
deployment.apps/cht-couchdb 1/1 1 1 14m
deployment.apps/cht-haproxy 1/1 1 1 14m
deployment.apps/cht-haproxy-healthcheck 1/1 1 1 14m
deployment.apps/cht-sentinel 1/1 1 1 14m
deployment.apps/upgrade-service 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/cht-api-8554fc5b4c 1 1 1 14m
replicaset.apps/cht-couchdb-f86c9cf47 1 1 1 14m
replicaset.apps/cht-haproxy-756f896d6d 1 1 1 14m
replicaset.apps/cht-haproxy-healthcheck-7c8d4dbfb4 1 1 1 14m
replicaset.apps/cht-sentinel-7d8987d4db 1 1 1 14m
replicaset.apps/upgrade-service-67f48c5fc4 1 1 1 14m
But I get a 502 - bad gateway
in the browser.
Couch seems in a bad way, which is likely the main problem:
[warning] 2024-08-31T04:49:55.337953Z couchdb@127.0.0.1 <0.1449.0> e669322402 couch_httpd_auth: Authentication failed for user medic from 100.64.213.104
[notice] 2024-08-31T04:49:55.338171Z couchdb@127.0.0.1 <0.1449.0> e669322402 couchdb.mrjones-dev.svc.cluster.local:5984 100.64.213.104 undefined GET /_membership 401 ok 1
[notice] 2024-08-31T04:49:55.703891Z couchdb@127.0.0.1 <0.394.0> -------- chttpd_auth_cache changes listener died because the _users database does not exist. Create the database to silence this notice.
[error] 2024-08-31T04:49:55.704074Z couchdb@127.0.0.1 emulator -------- Error in process <0.1467.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}
[error] 2024-08-31T04:49:55.704137Z couchdb@127.0.0.1 emulator -------- Error in process <0.1467.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}
With couch down, it's not worth checking, but API and sentinel are unhappy - they both have near identical 503
errors:
StatusCodeError: 503 - {"error":"503 Service Unavailable","reason":"No server is available to handle this request","server":"haproxy"}
at new StatusCodeError (/service/api/node_modules/request-promise-core/lib/errors.js:32:15)
at Request.plumbing.callback (/service/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
at Request.RP$callback [as _callback] (/service/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
at Request.self.callback (/service/api/node_modules/request/request.js:185:22)
at Request.emit (node:events:513:28)
at Request.<anonymous> (/service/api/node_modules/request/request.js:1154:10)
at Request.emit (node:events:513:28)
at IncomingMessage.<anonymous> (/service/api/node_modules/request/request.js:1076:12)
at Object.onceWrapper (node:events:627:28)
at IncomingMessage.emit (node:events:525:35) {
statusCode: 503,
error: {
error: '503 Service Unavailable',
reason: 'No server is available to handle this request',
server: 'haproxy'
}
}
HA Proxy is unsurprisingly 503
ing:
<150>Aug 31 04:58:22 haproxy[12]: 100.64.213.102,<NOSRV>,503,0,0,0,GET,/,-,medic,'-',241,-1,-,'-'
@dianabarsan and I did deep dive into this today and my test instance now starts up instead of 502
ing! However, it's on a clean install of CHT core instead of showing the cloned prod data.
At this point we suspect it might be a permissions error maybe? Per below, the volume mounts but we can't see any of the data, so that's our guess.
We found out that:
nodes:
in the local storage section. cht-couchdb
pod
$ kubectl -n mrjones-dev exec -it cht-couchdb-f86c9cf47-5msts -- df -h
Filesystem Size Used Avail Use% Mounted on
overlay 485G 40G 445G 9% /
/dev/nvme2n1 886G 71G 816G 8% /opt/couchdb/data
$ kubectl -n mrjones-dev exec -it cht-couchdb-f86c9cf47-5msts -- du --max-depth=1 -h /opt/couchdb/data
1.8M /opt/couchdb/data/.shards
5.0M /opt/couchdb/data/shards
4.0K /opt/couchdb/data/.delete
6.9M /opt/couchdb/data
looking at the prod instance where it was cloned from, it's mounted in the same path AND there's data in it:
kubectl config set-context arn:aws:eks:eu-west-2:720541322708:cluster/prod-cht-eks
kubectl -n moh-zanzibar-prod exec -it cht-couchdb-1-cb788fc65-vjsn5 -- df -h
Filesystem Size Used Avail Use% Mounted on
overlay 485G 14G 472G 3% /
/dev/nvme1n1 886G 67G 820G 8% /opt/couchdb/data
kubectl -n moh-zanzibar-prod exec -it cht-couchdb-1-cb788fc65-vjsn5 -- du --max-depth=1 -h /opt/couchdb/data
12K /opt/couchdb/data/._users_design
12K /opt/couchdb/data/._replicator_design
33G /opt/couchdb/data/.shards
31G /opt/couchdb/data/shards
4.0K /opt/couchdb/data/.delete
63G /opt/couchdb/data
password:
and secret:
and user:
in the values file to be identical to the prod instances we cloned, and this did not fix things.vol-0cf04a56d8d59f74b
) off the most recent snapshot (see snap-01a976f6a4e51684c
), being sure to set all the tags correctly. This failed in the same way as above (mounted correctly, but clean install)I have some downtime today. I will try to have a look if it's a quick thing.
Pushed a PR here. Let me know if that solves it.
Thanks so much for coming off your holiday to do some work!
Per my slack comment, I don't know how to test this branch in the cht-conf script
It doesn't release beta builds for now. If the code looks good for you then only way to test right now is to approve and merge the PR which should release a patch version of the helm charts which cht-deploy will pick when deploying
Despite it only being 7 lines of change, I'm not really in a position to know if these changes look good. I don't know helm
, I don't know EKS and I believe these charts are used for every production CHT Core instance we run - which gives me pause.
I would very much like to be able to test this or defer to someone else who knows what these changes actually do.
I'll pursue the idea of running the changes manually via helm install...
per this slack thread and see how far I can get.
I tried this just now and got the same result:
helm delete mrjones-dev --namespace mrjones-dev
git status
On branch user-root-for-couchdb-container
Your branch is up to date with 'origin/user-root-for-couchdb-container'.
helm upgrade
command, passing in the full path to the branch of helm charts with the changes to test: helm upgrade mrjones-dev /home/mrjones/Documents/MedicMobile/helm-charts/charts/cht-chart-4x --install --version 1.0.* --namespace mrjones-dev --values mrjones-muso.yml --set cht_image_tag=4.5.2
mrjones-muso.yml --set cht_image_tag=4.5.2
Release "mrjones-dev" does not exist. Installing it now.
NAME: mrjones-dev
LAST DEPLOYED: Thu Sep 5 15:22:10 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 1
TEST SUITE: None
kubectl -n mrjones-dev exec -it cht-couchdb-57c74f9fc-qtrx5 -- df -h
:
Filesystem Size Used Avail Use% Mounted on
overlay 485G 42G 444G 9% /
/dev/nvme2n1 886G 67G 820G 8% /opt/couchdb/data
kubectl -n mrjones-dev exec -it cht-couchdb-57c74f9fc-qtrx5 -- du --max-depth=1 -h /opt/couchdb/data
:
1.9M /opt/couchdb/data/.shards
5.0M /opt/couchdb/data/shards
4.0K /opt/couchdb/data/.delete
6.9M /opt/couchdb/data
Description
Create section in EKS docs on how to clone an instance
License
The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.