Open trbutler opened 1 week ago
I still haven't been able to solve this, but I did setup a second Netbird server and moved the peers over to it. So far, I've not been seeing the same issue. So it makes me think perhaps it is something to do with the upgrade path to the latest containers? It still worries me, though, since it took a clean slate with all the peers being manually reconnected to a new installation of the server to get things up and running again. I'm going to wipe the old server, but have left it up for the moment if you have any debug data about it you'd like before I wipe it.
Describe the problem
I frequently receive
Signal: Disconnected, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded.
on all of my Netbird clients. The issue appears to be degrading from something that caused intermittent communications problems to a situation where Netbird is almost completely non-functional to most of my clients. Inexplicably a few continue to work.I've tried adapting my Netbird "quick start" self-hosted configuration to alleviate the issue. I moved from using Caddy to NGINX for reverse proxy. This sped things up a fair amount and reduced resource usage, but didn't fix the issue. I also tried directly exposing Signal (which I had Docker translate from 443 to port 30006) while giving it access to NGINX's SSL certificate, so that a reverse proxy was not involved at all. None of these three different arrangements resolved the issue.
When proxied through NGINX, the NGINX error log is filled with entries like this:
The Signal docker container doesn't show anything unusual, even when set to debug mode on the logs; it simply shows many messages being conveyed between peers.
To Reproduce
Steps to reproduce the behavior:
netbird up
netbird status
will report the issue.Expected behavior
I'd expect Netbird to be able to connect to the Signal server without issue.
Are you using NetBird Cloud?
I'm using self-hosted netbird.
NetBird version
netbird version
NetBird status -dA output:
Do you face any (non-mobile) client issues?
Yes, the issue prevents clients from functioning. Presently most clients cannot connect, although a few consistently do connect. There is no rhyme or reason I've been able to discern: with two clients in the same location, one consistently connects and one does not; the variation does not appear to relate to platform (some of what works is MacOS, some are running Debian Linux). Reauthorizing the clients with a new setup key doesn't seem to change things for the worse or better -- it is like they are "stuck" either working or not.
(Although all clients show the Signal error given above at least part of the time.)
Additional context
I'm using a modified version of the docker-compose.yml that was available back in December 2023. It's been upgraded to add the new relay container, remove Caddy (as noted above as part of troubleshooting), expose the NGINX SSL cert to Signal, etc. Because it is from last year, it uses CockroachDB instead of PostgreSQL. I've wondered about finding a way to migrate cleanly to PostgreSQL, though I don't know if that'd materially affect this problem or not.
My docker-compose.yml: