gotify / server

A simple server for sending and receiving messages in real-time per WebSocket. (Includes a sleek web-ui)
https://gotify.net
Other
11.49k stars 639 forks source link

Networking problem: intermitent "No route to host" / "Failed to connect" errors #708

Closed gboudreau closed 3 weeks ago

gboudreau commented 3 weeks ago

Have you read the documentation?

You are setting up gotify in

Describe your problem

I'm running Gotify using Docker on a Debian host. It listens on port 8003 (port 80 in container is mapped to port 8003 on host).

gb@server $ sudo netstat -anp | grep 8003
tcp        0      0 0.0.0.0:8003            0.0.0.0:*               LISTEN      3039255/docker-prox 

gb@server $ docker inspect gotify | grep HostPort
                        "HostPort": "8003"

Sometimes, say 9 times out of 10, when I try to connect to Gotify from the server, I get a "No route to host" error. This happens if I use my LAN IP, or Docker's network IP:

gb@server $ curl http://192.168.155.88:8003/
curl: (7) Failed to connect to 192.168.155.88 port 8003: No route to host

gb@server $ curl http://172.18.0.1:8003/
curl: (7) Failed to connect to 172.18.0.1 port 8003: No route to host

But sometimes, for no apparent reason, it just works...

gb@server $ curl http://172.18.0.1:8003/
<!doctype html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" ...

If I use localhost or 127.0.0.1 to connect, it also works only 1/10 times, but the error is Connection reset by peer:

gb@server $ curl http://localhost:8003/
curl: (56) Recv failure: Connection reset by peer

gb@server $ curl http://127.0.0.1:8003/
curl: (56) Recv failure: Connection reset by peer

This problem doesn't happen with other Docker containers I'm running. I have about 60 in total, most of them listening on some port, and they all seem to work fine.

When I try to connect to Gotify from a remote host, using the LAN IP (or the VPN IP), it works 9 times out of 10 (i.e. much more often). And when it fails, the error is "Couldn't connect to server" after what seems like a random time between 2 and 20 seconds:

gb@workstation $ curl "http://192.168.155.88:8003/"
<!doctype html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" ...

gb@workstation $ curl "http://192.168.155.88:8003/"
curl: (7) Failed to connect to 192.168.155.88 port 8003 after 3115 ms: Couldn't connect to server

gb@workstation $ curl "http://192.168.155.88:8003/"
curl: (7) Failed to connect to 192.168.155.88 port 8003 after 16215 ms: Couldn't connect to server

And sometimes, when it works, instead of returning in less than 500ms, there will be a much longer delay to receive a response:

gb@workstation $ time bash -c 'curl -so /dev/null "http://192.168.155.88:8003/" ; echo $?'
0
real    0m0.141s

gb@workstation $ time bash -c 'curl -so /dev/null "http://192.168.155.88:8003/" ; echo $?'
0
real    0m0.634s

gb@workstation $ time bash -c 'curl -so /dev/null "http://192.168.155.88:8003/" ; echo $?'
0
real    0m0.348s

gb@workstation $ time bash -c 'curl -so /dev/null "http://192.168.155.88:8003/" ; echo $?'
0
real    0m0.240s

gb@workstation $ time bash -c 'curl -so /dev/null "http://192.168.155.88:8003/" ; echo $?'
0
real    0m0.442s

gb@workstation $ time bash -c 'curl -so /dev/null "http://192.168.155.88:8003/" ; echo $?'
0
real    0m19.558s  # Worked, but took 20s

When an error occurs, trying to connect from the server or workstation, the Gotify logs show nothing. And when it takes 20s to return a response, the log don't show that; it always shows a 50-70µs response time:

2024-10-26T08:23:03-04:00 | 200 |      75.321µs |  192.168.155.44 | GET      "/"
2024-10-26T08:23:04-04:00 | 200 |      63.134µs |  192.168.155.44 | GET      "/"
2024-10-26T08:23:05-04:00 | 200 |      62.587µs |  192.168.155.44 | GET      "/"
2024-10-26T08:23:06-04:00 | 200 |       57.22µs |  192.168.155.44 | GET      "/"
2024-10-26T08:23:07-04:00 | 200 |        51.8µs |  192.168.155.44 | GET      "/"

The only error I can see in Gotify logs is when a remote host can connect (eg. the mobile app on Android), it will end up disconnecting the websocket with an i/o timeout error after a while:

2024-10-26T08:37:33-04:00 | 200 |   13.719278ms |      172.18.0.1 | GET      "/stream?token=[masked]"
2024-10-26T08:37:36-04:00 | 200 |     974.083µs |      172.18.0.1 | GET      "/message?limit=10"
WebSocket: ReadError read tcp 172.18.0.29:80->172.18.0.1:56152: i/o timeout

I have no firewall setup, ping always works, and route all looks fine:

$ sudo iptables -S INPUT
-P INPUT ACCEPT

$ sudo route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         modem           0.0.0.0         UG    0      0        0 enp6s0f1
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-c99fa089877c
172.19.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-f2c2237e5ed4
192.168.155.0   0.0.0.0         255.255.255.0   U     0      0        0 enp6s0f1
192.168.156.0   0.0.0.0         255.255.255.0   U     0      0        0 nebula1

$ ping 192.168.155.88
PING 192.168.155.88 (192.168.155.88) 56(84) bytes of data.
64 bytes from 192.168.155.88: icmp_seq=1 ttl=64 time=0.071 ms
64 bytes from 192.168.155.88: icmp_seq=2 ttl=64 time=0.036 ms
64 bytes from 192.168.155.88: icmp_seq=3 ttl=64 time=0.041 ms
64 bytes from 192.168.155.88: icmp_seq=4 ttl=64 time=0.035 ms
^C
--- 192.168.155.88 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3073ms
rtt min/avg/max/mdev = 0.035/0.045/0.071/0.014 ms

Any ideas on how I could debug this further?

najtin commented 3 weeks ago

There is most likely a problem in your network setup. Maybe there you will find some leads.

eternal-flame-AD commented 3 weeks ago

As discussed above unfortunately this feels like a networking issue and seems to be pretty specific. Posting more information may or may not help unfortunately (ip addr ip rule ip neigh and all iptables chains). Have you ran any other server application on docker before?

For further debugging this may be helpful:

https://hub.docker.com/r/alpine/socat/

gboudreau commented 3 weeks ago

I installed tcpdump to look further into what was happening, I stopped a cloudflared (tunnel) container to stop spam in tcpdump output, and changed net.core.wmem_max and net.core.rmem_max to 7500000 (a recommendation I found in the cloudflared logs), and after all that, this problem was gone... ¯\_(ツ)_/¯