Closed dianabarsan closed 1 month ago
I'm seeing this get fixed if I add a DNS resolver in haproxy config. Specifically, the DNS resolver for docker network - same one we added for nginx. However I worry about embedding this in the image, since we use haproxy in k8s as well.
I believe this is now affecting one of our prod instances, which still runs on Docker and that has a very flaky CouchDb due to upgrade efforts. https://github.com/medic/cht-core/issues/9286
It seems that adding the DNs resolver fails the deployment for k8s (as expected). I'm considering passing this as an environment variable.
I've tested k3d local deployment with single CouchDb, and scaled that couchdb. Services recovered automatically.
Having a conversation with @Hareet , seems that the test over k3d might be sufficient but we could test on k3s as well.
Describe the bug A couchdb restart in single node docker takes down the whole instance. I believe this is due to couchDb receiving a new IP in the docker network after restart - this is a fact, but I have not definitively proven this is the cause of the faulure,
To Reproduce Steps to reproduce the behavior:
docker stop cht_couchdb_1
docker start cht_couchdb_1
Expected behavior Services should come back online after one service failed.
Logs Haproxy continuously reports NOSRV errors like:
CouchDb continuously reports successful calls to membership (presumably coming from healthcheck) but no other incoming requests:
Healthcheck logs are silent.
Api logs report:
nginx reports
Environment
Additional context I wrote an e2e test for this, where I restarted all couchdb services from the cluster in docker, and the e2e test passes. I believe this is due to the fact that at least one couchdb server ends up on an IP that haproxy tries to access. We've had a similar problem with nginx DNS resolution https://github.com/medic/cht-core/issues/8205 .