Closed ShaneHarvey closed 4 weeks ago
Believe I've found the real culprit. The ss+awk command to find/kill the running MO server doesn't work on RHEL 8 anymore for unknown reasons:
$ ss -tlnp 'sport = :8889'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 5 127.0.0.1:8889 0.0.0.0:* users:(("mongo-orchestra",pid=7676,fd=5))
$ ss -tlnp 'sport = :8889' | awk 'NR>1 {split($7,a,","); print a[1]}'
Perhaps ss was updated and reports output differently than it used to? Or perhaps awk has changed? Either way using fuser should do the trick:
$ fuser 8889/tcp
8889/tcp: 8924
$ fuser --kill 8889/tcp
8889/tcp: 8924
$ fuser 8889/tcp
Testing via node here (with more tasks now): https://spruce.mongodb.com/version/66f591a5d6b5680007128d57/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC
The worst time for us to now see some sporadic node download failures.. but as long as that patch doesn't show any deep purple, this should have fixed it. TY!
Okay it looks much better but now mongod is sometimes failing to start with "Address already in use" (here):
[2024/09/26 10:22:42.788] {"t":{"$date":"2024-09-26T17:19:42.436+00:00"},"s":"E", "c":"STORAGE", "id":20568, "ctx":"initandlisten","msg":"Error setting up listener","attr":{"error":{"code":9001,"codeName":"SocketException","errmsg":"Address already in use"}}}
My guess is that we killed the old MO server but the mongod was still left running on that port.
We could merge this as is to get some of the issue unblocked, but any risk killing the typical db ports here?
I'll open a new PR for killing the old mongo servers.
DRIVERS-2991 Properly kill old MO server to avoid "Address already in use" errors.
The ss+awk solution doesn't work on RHEL 8 anymore for unknown reasons:
Perhaps
ss
was updated and reports output differently than it used to? Or perhapsawk
has changed? Either way usingfuser
should do the trick.