mongodb-labs / drivers-evergreen-tools

Scripts for MongoDB drivers to bootstrap their Evergreen configuration file - This Repository is NOT a supported MongoDB product
10 stars 64 forks source link

DRIVERS-2991 Properly kill old MO server to avoid "Address already in use" errors #504

Closed ShaneHarvey closed 4 weeks ago

ShaneHarvey commented 1 month ago

DRIVERS-2991 Properly kill old MO server to avoid "Address already in use" errors.

The ss+awk solution doesn't work on RHEL 8 anymore for unknown reasons:

$ ss -tlnp 'sport = :8889'
State                           Recv-Q                           Send-Q                                                       Local Address:Port                                                       Peer Address:Port
LISTEN                          0                                5                                                                127.0.0.1:8889                                                            0.0.0.0:*                               users:(("mongo-orchestra",pid=7676,fd=5))
$ ss -tlnp 'sport = :8889' | awk 'NR>1 {split($7,a,","); print a[1]}'

Perhaps ss was updated and reports output differently than it used to? Or perhaps awk has changed? Either way using fuser should do the trick.

durran commented 4 weeks ago

Unfortunately: https://spruce.mongodb.com/version/66f518d4a63b1a00079b693c/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

ShaneHarvey commented 4 weeks ago

Believe I've found the real culprit. The ss+awk command to find/kill the running MO server doesn't work on RHEL 8 anymore for unknown reasons:

$ ss -tlnp 'sport = :8889'
State                           Recv-Q                           Send-Q                                                       Local Address:Port                                                       Peer Address:Port
LISTEN                          0                                5                                                                127.0.0.1:8889                                                            0.0.0.0:*                               users:(("mongo-orchestra",pid=7676,fd=5))
$ ss -tlnp 'sport = :8889' | awk 'NR>1 {split($7,a,","); print a[1]}'

Perhaps ss was updated and reports output differently than it used to? Or perhaps awk has changed? Either way using fuser should do the trick:

$ fuser 8889/tcp
8889/tcp:             8924
$ fuser --kill 8889/tcp
8889/tcp:             8924
$ fuser 8889/tcp

Testing via node here (with more tasks now): https://spruce.mongodb.com/version/66f591a5d6b5680007128d57/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

nbbeeken commented 4 weeks ago

The worst time for us to now see some sporadic node download failures.. but as long as that patch doesn't show any deep purple, this should have fixed it. TY!

ShaneHarvey commented 4 weeks ago

Okay it looks much better but now mongod is sometimes failing to start with "Address already in use" (here):

 [2024/09/26 10:22:42.788] {"t":{"$date":"2024-09-26T17:19:42.436+00:00"},"s":"E",  "c":"STORAGE",  "id":20568,   "ctx":"initandlisten","msg":"Error setting up listener","attr":{"error":{"code":9001,"codeName":"SocketException","errmsg":"Address already in use"}}}

My guess is that we killed the old MO server but the mongod was still left running on that port.

nbbeeken commented 4 weeks ago

We could merge this as is to get some of the issue unblocked, but any risk killing the typical db ports here?

ShaneHarvey commented 4 weeks ago

I'll open a new PR for killing the old mongo servers.