medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
438 stars 209 forks source link

CouchDb restart causes all services to go down #9284

Closed dianabarsan closed 1 month ago

dianabarsan commented 2 months ago

Describe the bug A couchdb restart in single node docker takes down the whole instance. I believe this is due to couchDb receiving a new IP in the docker network after restart - this is a fact, but I have not definitively proven this is the cause of the faulure,

To Reproduce Steps to reproduce the behavior:

  1. Launch single node couchdb 4.9.0 in docker
  2. run docker stop cht_couchdb_1
  3. run docker start cht_couchdb_1
  4. see that you never get your instance back up automatically.

Expected behavior Services should come back online after one service failed.

Logs Haproxy continuously reports NOSRV errors like:

<150>Jul 25 18:11:03 haproxy[12]: 172.18.0.9,<NOSRV>,503,0,1001,0,GET,/,-,admin,'-',241,-1,-,'-'
<150>Jul 25 18:11:03 haproxy[12]: 172.18.0.6,<NOSRV>,503,0,1001,0,GET,/,-,admin,'-',241,-1,-,'-'
<150>Jul 25 18:11:04 haproxy[12]: 172.18.0.9,<NOSRV>,503,0,1002,0,GET,/,-,admin,'-',241,-1,-,'-'
<150>Jul 25 18:11:04 haproxy[12]: 172.18.0.6,<NOSRV>,503,0,1002,0,GET,/,-,admin,'-',241,-1,-,'-'
<150>Jul 25 18:11:05 haproxy[12]: 172.18.0.9,<NOSRV>,503,0,1002,0,GET,/,-,admin,'-',241,-1,-,'-'

CouchDb continuously reports successful calls to membership (presumably coming from healthcheck) but no other incoming requests:

[notice] 2024-07-25T18:11:52.724582Z couchdb@127.0.0.1 <0.8989.1> fb1e784e50 couchdb:5984 172.18.0.7 admin GET /_membership 200 ok 0
[notice] 2024-07-25T18:11:57.512394Z couchdb@127.0.0.1 <0.9027.1> 8c2f2b1382 couchdb:5984 172.18.0.7 admin GET /_membership 200 ok 0
[notice] 2024-07-25T18:12:02.518797Z couchdb@127.0.0.1 <0.9084.1> 9d37d60d92 couchdb:5984 172.18.0.7 admin GET /_membership 200 ok 0
[notice] 2024-07-25T18:12:07.552834Z couchdb@127.0.0.1 <0.9125.1> fd140170b9 couchdb:5984 172.18.0.7 admin GET /_membership 200 ok 0
[notice] 2024-07-25T18:12:12.401881Z couchdb@127.0.0.1 <0.9163.1> 5f8b37fb3b couchdb:5984 172.18.0.7 admin GET /_membership 200 ok 0
[notice] 2024-07-25T18:12:17.502015Z couchdb@127.0.0.1 <0.9204.1> 20052022c7 couchdb:5984 172.18.0.7 admin GET /_membership 200 ok 0
[notice] 2024-07-25T18:12:22.733579Z couchdb@127.0.0.1 <0.9246.1> 7585e2e002 couchdb:5984 172.18.0.7 admin GET /_membership 200 ok 0

Healthcheck logs are silent.

Api logs report:

StatusCodeError: 503 - {"error":"503 Service Unavailable","reason":"No server is available to handle this request","server":"haproxy"}
    at new StatusCodeError (/service/api/node_modules/request-promise-core/lib/errors.js:32:15)
    at plumbing.callback (/service/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
    at Request.RP$callback [as _callback] (/service/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at self.callback (/service/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:518:28)
    at Request.<anonymous> (/service/api/node_modules/request/request.js:1154:10)
    at Request.emit (node:events:518:28)
    at IncomingMessage.<anonymous> (/service/api/node_modules/request/request.js:1076:12)
    at Object.onceWrapper (node:events:632:28)
    at IncomingMessage.emit (node:events:530:35) {
  statusCode: 503,
  error: {
    error: '503 Service Unavailable',
    reason: 'No server is available to handle this request',
    server: 'haproxy'
  }
}
StatusCodeError: 503 - {"error":"503 Service Unavailable","reason":"No server is available to handle this request","server":"haproxy"}
    at new StatusCodeError (/service/api/node_modules/request-promise-core/lib/errors.js:32:15)
    at plumbing.callback (/service/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
    at Request.RP$callback [as _callback] (/service/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at self.callback (/service/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:518:28)
    at Request.<anonymous> (/service/api/node_modules/request/request.js:1154:10)
    at Request.emit (node:events:518:28)
    at IncomingMessage.<anonymous> (/service/api/node_modules/request/request.js:1076:12)
    at Object.onceWrapper (node:events:632:28)
    at IncomingMessage.emit (node:events:530:35) {
  statusCode: 503,
  error: {
    error: '503 Service Unavailable',
    reason: 'No server is available to handle this request',
    server: 'haproxy'
  }
}
StatusCodeError: 503 - {"error":"503 Service Unavailable","reason":"No server is available to handle this request","server":"haproxy"}
    at new StatusCodeError (/service/api/node_modules/request-promise-core/lib/errors.js:32:15)
    at plumbing.callback (/service/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
    at Request.RP$callback [as _callback] (/service/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at self.callback (/service/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:518:28)
    at Request.<anonymous> (/service/api/node_modules/request/request.js:1154:10)
    at Request.emit (node:events:518:28)
    at IncomingMessage.<anonymous> (/service/api/node_modules/request/request.js:1076:12)
    at Object.onceWrapper (node:events:632:28)
    at IncomingMessage.emit (node:events:530:35) {
  statusCode: 503,
  error: {
    error: '503 Service Unavailable',
    reason: 'No server is available to handle this request',
    server: 'haproxy'
  }
}

nginx reports

2024/07/25 18:40:28 [error] 43#43: *5757 connect() failed (111: Connection refused) while connecting to upstream, client: 172.18.0.1, server: , request: "GET /medic-user-admin-meta/_all_docs?startkey=%22feedback%22&endkey=%22feedback%EF%BF%B0%22&limit=1000 HTTP/2.0", upstream: "http://172.18.0.6:5988/medic-user-admin-meta/_all_docs?startkey=%22feedback%22&endkey=%22feedback%EF%BF%B0%22&limit=1000", host: "127.0.0.1", referrer: "https://127.0.0.1/"
172.18.0.1 - - [25/Jul/2024:18:40:28 +0000] "GET /medic-user-admin-meta/_all_docs?startkey=%22feedback%22&endkey=%22feedback%EF%BF%B0%22&limit=1000 HTTP/2.0" 502 72 "https://127.0.0.1/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"

Environment

Additional context I wrote an e2e test for this, where I restarted all couchdb services from the cluster in docker, and the e2e test passes. I believe this is due to the fact that at least one couchdb server ends up on an IP that haproxy tries to access. We've had a similar problem with nginx DNS resolution https://github.com/medic/cht-core/issues/8205 .

dianabarsan commented 2 months ago

I'm seeing this get fixed if I add a DNS resolver in haproxy config. Specifically, the DNS resolver for docker network - same one we added for nginx. However I worry about embedding this in the image, since we use haproxy in k8s as well.

dianabarsan commented 2 months ago

I believe this is now affecting one of our prod instances, which still runs on Docker and that has a very flaky CouchDb due to upgrade efforts. https://github.com/medic/cht-core/issues/9286

It seems that adding the DNs resolver fails the deployment for k8s (as expected). I'm considering passing this as an environment variable.

dianabarsan commented 2 months ago

I've tested k3d local deployment with single CouchDb, and scaled that couchdb. Services recovered automatically.

dianabarsan commented 2 months ago

Having a conversation with @Hareet , seems that the test over k3d might be sufficient but we could test on k3s as well.