[BUG] server crash (msg-gateway process)

ting-xu commented 3 months ago

OpenIM Server Version

3.7.0

Operating System and CPU Architecture

Linux (AMD)

Deployment Method

Docker Deployment

Bug Description and Steps to Reproduce

We have two machines deployed using Etcd as discovery.

In our online production environment, server crashed twice at different time, both had same prometheus metrics change before happen.

in qps, msggw - rpc drop fast (about 1 minute from ~500 to zero)
in Goroutines, msggw grow fast (in 10 minutes from ~5K to 95K)
in process memory, msggw grow fast (in 10 mintues from ~100MB to 1.6GB)

i took some time to dig code, i guess the root cause is in the impl of func (ws *WsServer) Run(done chan error) error this function start websocket server and a SINGLE ONE goroutine to process all messages from 3 channels ONE BY ONE sequentially. (our version is 3.7.0, but it's same in 3.8.0 code)

though i have no absolute evidence, i thought something happend and block in this goroutine, then register channel is quickly full of new messages (channel buffer size is 1000), after that, new incoming websocket request force server to create more and more new goroutine to handle request and block on channel writing, at the same time memory grow more and more, finally process crash down.

in our machine cloud environment, network jitter is not uncommon due to cloud provider factor. so one possible case is, network jitter cause packet loss, then gRPC request to another machine instance didn't get response, according to code, the gRPC call has no timeout (the context is a private impl, no deadline), so calling will wait forever, block all subsequent channel messages. in application layer, gRPC dial does not set TCP Keepalive option, so recv will not know TCP closed, it just wait.

i suggest quick fix is set TCP keepalive in gRPC dial option, better improvement is to change this single goroutine process.

Screenshots Link

No response

kubbot commented 3 months ago

Hello! Thank you for filing an issue.

If this is a bug report, please include relevant logs to help us debug the problem.

Join slack 🤖 to connect and communicate with our developers.

skiffer-git commented 3 months ago

ting-xu commented 3 months ago

Another important clue: when the problem happened, both machines' msg gateway processes started to work abnormally at the same time.
Each had the same prometheus metric change (qps drop to 0, goroutines and memories grows)

So the whole service down, since two machines were not independent each other.

ting-xu commented 2 months ago

Any schedule plan on this problem ? Currently to mitigate service unstability caused by this, we have to run only one msg-gw process instance on one machine of two, while keep other processes running on both machines.

openimsdk / open-im-server