CHT 4.x Docker services fail to communicate in multi-nodes setup

mrjones-plip commented 1 year ago

Describe the bug Often, when I start my multi-node clustered CHT 4.x instance, I get errors in my API container and nginx containers and the CHT fails to start

To Reproduce

Set up 4 computers as a multi-node clustered CHT 4.x instance
run docker compose up -d on the CHT Core node
Go to the browser and try and load the CHT

Expected behavior The CHT shows a login page and loads correctly

Logs

The API container has this error repeatedly:

RequestError: Error: connect ECONNREFUSED 10.0.1.7:5984
    at new RequestError (/api/node_modules/request-promise-core/lib/errors.js:14:15)
    at Request.plumbing.callback (/api/node_modules/request-promise-core/lib/plumbing.js:87:29)
    at Request.RP$callback [as _callback] (/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at self.callback (/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:513:28)
    at Request.onRequestError (/api/node_modules/request/request.js:877:8)
    at ClientRequest.emit (node:events:513:28)
    at Socket.socketErrorListener (node:_http_client:481:9)
    at Socket.emit (node:events:513:28)
    at emitErrorNT (node:internal/streams/destroy:157:8) {
  cause: Error: connect ECONNREFUSED 10.0.1.7:5984
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1247:16) {
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '10.0.1.7',
    port: 5984
  },
  error: Error: connect ECONNREFUSED 10.0.1.7:5984
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1247:16) {
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '10.0.1.7',
    port: 5984
  }
}

This, in turn, causes the the webserver (nginx) to fail to talk to API, so the browser to my instance gives a 502 error:

502 Bad Gateway
nginx/1.19.6

The nginx container of course has errors too, because API can't talk to HA Proxy:

CERTIFICATE MODE = SELF_SIGNED
Running SSL certificate checks
self signed SSL cert already exists.
Launching Nginx
/docker-entrypoint.sh: Configuration complete; ready for start up
10.131.161.1 - - [26/Jan/2023:23:33:51 +0000] "GET /medic/login?redirect=https%3A%2F%2F10-131-161-147.my.local-ip.co%2F HTTP/1.1" 502 157 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0"
2023/01/26 23:33:51 [error] 29#29: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 10.131.161.1, server: , request: "GET /medic/login?redirect=https%3A%2F%2F10-131-161-147.my.local-ip.co%2F HTTP/1.1", upstream: "http://10.0.1.4:5988/medic/login?redirect=https%3A%2F%2F10-131-161-147.my.local-ip.co%2F", host: "10-131-161-147.my.local-ip.co"

Screenshots

Environment

Instance: local dev
Browser: all
Client platform: ubuntu
App: docker
Version: any multi-node clustered 4.x instance

Additional context

When testing this, everything may work on the first try. If that's the case try and run this 5 or 6 times on the CHT core node and check in a browser if it succeeds each time:

docker kill $(docker ps --quiet)
cd /home/ubuntu/cht/upgrade-service/
docker compose up -d

The semi-functional work around is to aggressively restart containers, with out changing anything and then some how it starts working again.

There was some speculation (private slack thread) that the fix to this was:

change all CouchDB names from couchdb.1 -> couchdb-1.local
in the CouchDB compose files (1 on each node), add an explicit name: container_name: couchdb-1.local

This is not the case. I still get high occurrences of this error when I set my CouchDB servers in my CHT Core .env file to have the correct name: COUCHDB_SERVERS=couchdb-1.local,couchdb-2.local,couchdb-3.local and then set each of CouchDB compose file look like this node 1 in this example (below).

version: '3.9'
services:
  couchdb-1.local:
    container_name: couchdb-1.local
    image: public.ecr.aws/s5s3h4s7/cht-couchdb:4.0.1-4.0.1
    volumes:
      - ./srv:/opt/couchdb/data
      - cht-credentials:/opt/couchdb/etc/local.d/
    environment:
      - "COUCHDB_USER=${COUCHDB_USER:-admin}"
      - "COUCHDB_PASSWORD=${COUCHDB_PASSWORD:?COUCHDB_PASSWORD must be set}"
      - "COUCHDB_SECRET=${COUCHDB_SECRET}"
      - "COUCHDB_UUID=${COUCHDB_UUID}"
      - "SVC_NAME=${SVC1_NAME:-couchdb-1.local}"
      - "CLUSTER_PEER_IPS=couchdb-2.local,couchdb-3.local"
      - "COUCHDB_LOG_LEVEL=${COUCHDB_LOG_LEVEL:-error}"
    logging:
      driver: "local"
      options:
        max-size: "${LOG_MAX_SIZE:-50m}"
        max-file: "${LOG_MAX_FILES:-20}"
    restart: always
    networks:
      - cht-net
      - cht-overlay
volumes:
  cht-credentials:
networks:
  cht-net:
     name: ${CHT_NETWORK:-cht-net}
  cht-overlay:
     driver: overlay
     external: true

henokgetachew commented 1 year ago

Removing cht-net from the networks property should fix this issue.

Only the overlay network is required when running the CHT over a distributed cluster. It should be:

    networks:
      - cht-overlay

This should be the case for all other services.

mrjones-plip commented 1 year ago

Yes! Thanks for the confirmation. I discovered this in my own testing as well .

mrjones-plip commented 1 year ago

Thinking on this some more, @henokgetachew - what about declaring cht-net as the overlay before we launch any services? We'll have to change the docker network create instructions to use the new name, but that's easy.

Then the instructions for editing the compose files can be much more simple, something like:

Find the networks: section at the very bottom of the compose file and add two lines so it looks like this:
networks:
   cht-net:
       name: ${CHT_NETWORK:-cht-net}
       driver: overlay
       external: true

While we should still release stand alone multi-node compose files that are ready for production, fewer manual edits is better I think!

mrjones-plip commented 1 year ago

latest docs PR fixes this - which is now live here. As there was no inherent code bug, removed the labels for "affects" for all 4.x versions.

medic / cht-core

CHT 4.x Docker services fail to communicate in multi-nodes setup #8050