medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
467 stars 217 forks source link

CHT 4.x Docker services fail to communicate in multi-nodes setup #8050

Closed mrjones-plip closed 1 year ago

mrjones-plip commented 1 year ago

Describe the bug Often, when I start my multi-node clustered CHT 4.x instance, I get errors in my API container and nginx containers and the CHT fails to start

To Reproduce

  1. Set up 4 computers as a multi-node clustered CHT 4.x instance
  2. run docker compose up -d on the CHT Core node
  3. Go to the browser and try and load the CHT

Expected behavior The CHT shows a login page and loads correctly

Logs

The API container has this error repeatedly:

RequestError: Error: connect ECONNREFUSED 10.0.1.7:5984
    at new RequestError (/api/node_modules/request-promise-core/lib/errors.js:14:15)
    at Request.plumbing.callback (/api/node_modules/request-promise-core/lib/plumbing.js:87:29)
    at Request.RP$callback [as _callback] (/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at self.callback (/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:513:28)
    at Request.onRequestError (/api/node_modules/request/request.js:877:8)
    at ClientRequest.emit (node:events:513:28)
    at Socket.socketErrorListener (node:_http_client:481:9)
    at Socket.emit (node:events:513:28)
    at emitErrorNT (node:internal/streams/destroy:157:8) {
  cause: Error: connect ECONNREFUSED 10.0.1.7:5984
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1247:16) {
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '10.0.1.7',
    port: 5984
  },
  error: Error: connect ECONNREFUSED 10.0.1.7:5984
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1247:16) {
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '10.0.1.7',
    port: 5984
  }
}

This, in turn, causes the the webserver (nginx) to fail to talk to API, so the browser to my instance gives a 502 error:

502 Bad Gateway
nginx/1.19.6

The nginx container of course has errors too, because API can't talk to HA Proxy:

CERTIFICATE MODE = SELF_SIGNED
Running SSL certificate checks
self signed SSL cert already exists.
Launching Nginx
/docker-entrypoint.sh: Configuration complete; ready for start up
10.131.161.1 - - [26/Jan/2023:23:33:51 +0000] "GET /medic/login?redirect=https%3A%2F%2F10-131-161-147.my.local-ip.co%2F HTTP/1.1" 502 157 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0"
2023/01/26 23:33:51 [error] 29#29: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.131.161.1, server: , request: "GET /medic/login?redirect=https%3A%2F%2F10-131-161-147.my.local-ip.co%2F HTTP/1.1", upstream: "http://10.0.1.4:5988/medic/login?redirect=https%3A%2F%2F10-131-161-147.my.local-ip.co%2F", host: "10-131-161-147.my.local-ip.co"

Screenshots

image

Environment

Additional context

When testing this, everything may work on the first try. If that's the case try and run this 5 or 6 times on the CHT core node and check in a browser if it succeeds each time:

docker kill $(docker ps --quiet)
cd /home/ubuntu/cht/upgrade-service/
docker compose up -d

The semi-functional work around is to aggressively restart containers, with out changing anything and then some how it starts working again.

There was some speculation (private slack thread) that the fix to this was:

This is not the case. I still get high occurrences of this error when I set my CouchDB servers in my CHT Core .env file to have the correct name: COUCHDB_SERVERS=couchdb-1.local,couchdb-2.local,couchdb-3.local and then set each of CouchDB compose file look like this node 1 in this example (below).

version: '3.9'
services:
  couchdb-1.local:
    container_name: couchdb-1.local
    image: public.ecr.aws/s5s3h4s7/cht-couchdb:4.0.1-4.0.1
    volumes:
      - ./srv:/opt/couchdb/data
      - cht-credentials:/opt/couchdb/etc/local.d/
    environment:
      - "COUCHDB_USER=${COUCHDB_USER:-admin}"
      - "COUCHDB_PASSWORD=${COUCHDB_PASSWORD:?COUCHDB_PASSWORD must be set}"
      - "COUCHDB_SECRET=${COUCHDB_SECRET}"
      - "COUCHDB_UUID=${COUCHDB_UUID}"
      - "SVC_NAME=${SVC1_NAME:-couchdb-1.local}"
      - "CLUSTER_PEER_IPS=couchdb-2.local,couchdb-3.local"
      - "COUCHDB_LOG_LEVEL=${COUCHDB_LOG_LEVEL:-error}"
    logging:
      driver: "local"
      options:
        max-size: "${LOG_MAX_SIZE:-50m}"
        max-file: "${LOG_MAX_FILES:-20}"
    restart: always
    networks:
      - cht-net
      - cht-overlay
volumes:
  cht-credentials:
networks:
  cht-net:
     name: ${CHT_NETWORK:-cht-net}
  cht-overlay:
     driver: overlay
     external: true
henokgetachew commented 1 year ago

Removing cht-net from the networks property should fix this issue.

Only the overlay network is required when running the CHT over a distributed cluster. It should be:

    networks:
      - cht-overlay

This should be the case for all other services.

mrjones-plip commented 1 year ago

Yes! Thanks for the confirmation. I discovered this in my own testing as well .

mrjones-plip commented 1 year ago

Thinking on this some more, @henokgetachew - what about declaring cht-net as the overlay before we launch any services? We'll have to change the docker network create instructions to use the new name, but that's easy.

Then the instructions for editing the compose files can be much more simple, something like:

Find the networks: section at the very bottom of the compose file and add two lines so it looks like this:

networks:
   cht-net:
       name: ${CHT_NETWORK:-cht-net}
       driver: overlay
       external: true

While we should still release stand alone multi-node compose files that are ready for production, fewer manual edits is better I think!

mrjones-plip commented 1 year ago

latest docs PR fixes this - which is now live here. As there was no inherent code bug, removed the labels for "affects" for all 4.x versions.