psql: error: could not connect to server: Operation timed out

ethindp commented 3 years ago

Just recently rebooted my server and am trying to start all the services again. However, postgresql is failing to connect to the matrix-postgres server when doing task "Execute Postgres additional database initialization SQL file for synapse". The command is: /usr/bin/env docker run --rm --user=967:1000 --cap-drop=ALL --env-file=/matrix/postgres/env-postgres-psql --network matrix --mount type=bind,src=/tmp/matrix-postgres-init-additional-db-user-and-role.sql,dst=/matrix-postgres-init-additional-db-user-and-role.sql,ro --entrypoint=/bin/sh docker.io/postgres:13.4-alpine -c psql -h matrix-postgres --file=/matrix-postgres-init-additional-db-user-and-role.sql. The output is as follows:

  delta: '0:02:09.840770'
  end: '2021-08-28 20:04:21.943018'
  msg: non-zero return code
  rc: 2
  start: '2021-08-28 20:02:12.102248'
  stderr: |-
    psql: error: could not connect to server: Operation timed out
            Is the server running on host "matrix-postgres" (172.18.0.3) and accepting
            TCP/IP connections on port 5432?
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

Environment:

Docker version: 20.10.8, API 1.41, containerd v1.5.5, runc 1.0.2, docker-init 0.19.0
Kernel version: 5.13.13-hardened1-1-hardened
Ansible version: 4.4.0
Repository commit: 12a172f0

Output of docker ps:

CONTAINER ID   IMAGE                                     COMMAND                  CREATED              STATUS                          PORTS                                                                                                                                                                                                                                               NAMES
9e7753049eb6   ma1uta/ma1sd:2.5.0                        "/start.sh"              1 second ago         Up 1 second                     127.0.0.1:8090->8090/tcp                                                                                                                                                                                                                            matrix-ma1sd
ad1d21a56aec   turt2live/matrix-dimension:latest         "/docker-entrypoint.…"   About a minute ago   Up About a minute               127.0.0.1:8184->8184/tcp                                                                                                                                                                                                                            matrix-dimension
a9a4a260b2a2   matrixdotorg/synapse:v1.40.0              "/start.py run -m sy…"   About a minute ago   Up About a minute (unhealthy)   127.0.0.1:8008->8008/tcp, 8009/tcp, 127.0.0.1:8048->8048/tcp, 8448/tcp                                                                                                                                                                              matrix-synapse
3eed1267ee6d   zeratax/matrix-registration:v0.7.2        "matrix-registration…"   2 minutes ago        Up 2 minutes                    127.0.0.1:8767->5000/tcp                                                                                                                                                                                                                            matrix-registration
dd4831494d07   coturn/coturn:4.5.2-r2-alpine             "turnserver -c /turn…"   18 minutes ago       Up 18 minutes                   0.0.0.0:3478->3478/tcp, 0.0.0.0:3478->3478/udp, :::3478->3478/tcp, :::3478->3478/udp, 0.0.0.0:5349->5349/udp, :::5349->5349/udp, 0.0.0.0:5349->5349/tcp, 0.0.0.0:49152-49172->49152-49172/udp, :::5349->5349/tcp, :::49152-49172->49152-49172/udp   matrix-coturn
a396f794637a   awesometechnologies/synapse-admin:0.8.1   "/docker-entrypoint.…"   18 minutes ago       Up 18 minutes                   127.0.0.1:8766->80/tcp                                                                                                                                                                                                                              matrix-synapse-admin
089346dbdfd8   localhost/vectorim/hydrogen-web:v0.2.5    "/docker-entrypoint.…"   18 minutes ago       Up 18 minutes                   80/tcp, 127.0.0.1:8768->8080/tcp                                                                                                                                                                                                                    matrix-client-hydrogen
7362d069e51b   devture/exim-relay:4.94.2-r0-3            "/sbin/tini -- exim …"   18 minutes ago       Up 18 minutes                   8025/tcp                                                                                                                                                                                                                                            matrix-mailer
340b3163d2f8   postgres:13.4-alpine                      "docker-entrypoint.s…"   18 minutes ago       Up 18 minutes                   5432/tcp                                                                                                                                                                                                                                            matrix-postgres
2d3908781969   vectorim/element-web:v1.8.1               "/docker-entrypoint.…"   18 minutes ago       Up 18 minutes                   80/tcp, 127.0.0.1:8765->8080/tcp                                                                                                                                                                                                                    matrix-client-element

Output of docker network ls:

NETWORK ID     NAME            DRIVER    SCOPE
394244faa111   bridge          bridge    local
e6abbcac20b1   host            host      local
cfb99dbf4a08   matrix          bridge    local
fae1de789b24   matrix-coturn   bridge    local
4420cf5296ca   none            null      local

Hope I provided enough information -- is there a reason for why this is happening?

spantaleev commented 3 years ago

It's probably the fact that Postgres is slow to start on your server and the playbook did not wait long enough for it to become available:

https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/40a72b2567cf9cbe5ce97d2fe17f752e84ef387e/roles/matrix-postgres/tasks/util/create_additional_databases.yml#L10-L15

https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/40a72b2567cf9cbe5ce97d2fe17f752e84ef387e/roles/matrix-postgres/defaults/main.yml#L75-L82

If you re-run the same exact command a 2nd time, it succeeds, doesn't it? We should probably increase the wait time

ethindp commented 3 years ago

@spantaleev That's what I thought, so I re-ran it. When that failed I posted this issue. I once again re-ran it a couple days ago but, as I expected it to, it failed again. I may need to rebuild all the containers from scratch. I assume that if I keep /matrixaround all my stuff will be restored? Also, I did use the "stop" tag before I rebooted my server, so maybe that interfered/caused problems. I was attempting to shut everything down cleanly.

spantaleev commented 3 years ago

Strange.. It usually works the 2nd time around.

Perhaps your container networking is borked and rebooting the server may help. You seem to have done that though.

So I'm not sure what would cause networking issues like that. Do you have SELinux enabled or some other security technology like that, which could be interfering? I see that you're running some kind of hardened kernel.

ethindp commented 3 years ago

@spantaleev no, I don't have that enabled though enabling that is a good idea. But no, I haven't yet. I just have the standard hardened Linux kernel. But it worked before so I have no idea what changed.

Xeboc commented 3 years ago

I've just run into this issue also, Ubuntu 21.04, similar docker versions, regular kernel. In my case I had an overly aggressive match in systemd-networkd's config:

/etc/systemd/network/99-all.network
[Match]
Name=*

...


I had to make the match more specific to the local interface (`Name=ens*`) to keep systemd-networkd from interfering with the docker veth and bridge interfaces.  

Functioning docker bridges can be checked by looking for an IP address / link on the `br-<uuid>` interface:

5: br-d900f507f32a: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:55:39:58:a0 brd ff:ff:ff:ff:ff:ff inet6 fe80::42:55ff:fe39:58a0/64 scope link valid_lft forever preferred_lft forever


And by this quick test:

docker network create test docker run --rm --net test --name nginx -d nginx docker run --rm --net test -it busybox wget -q -O - nginx



If the bridge network works, nginx responds between the containers.  If not, it gives the no route to host error:

> wget: can't connect to remote host (172.18.0.2): No route to host

It doesn't show up in logs very well because docker creates the networks correctly, then systemd-networkd makes changes.  Kernel logs / `dmesg` had clues about the link going down and networkd activity.

ethindp commented 3 years ago

This issue just mysteriously vanished after a reboot, so closing this.

spantaleev / matrix-docker-ansible-deploy

psql: error: could not connect to server: Operation timed out #1255