ossrs / srs

SRS is a simple, high-efficiency, real-time video server supporting RTMP, WebRTC, HLS, HTTP-FLV, SRT, MPEG-DASH, and GB28181.
https://ossrs.io
MIT License
24.79k stars 5.28k forks source link

Gracefully quit does not work for SRS 5.0 #3805

Open winlinvip opened 9 months ago

winlinvip commented 9 months ago

Description

Please description your issue here

  1. SRS Version: 5.0, 6.0

Replay

Please describe how to replay the bug?

Step 1: Run SRS

./objs/srs -c conf/console.conf

Step 2: Publish a stream by FFmpeg

ffmpeg -re -i doc/source.flv -c copy -f flv rtmp://localhost/live/livestream

Step 3: Gracefully quit SRS

./etc/init.d/srs grace

Expect

SRS should ideally wait for the RTMP connection to close or time out before quitting. However, it seems that SRS quits right away without waiting for the connection to close.

[2023-09-19 09:32:47.833][INFO][85369][79696179] sig=3, user start gracefully quit
[2023-09-19 09:32:48.780][INFO][85369][4h04hzr0] Hybrid cpu=0.00%,0MB, cid=2,1, timer=62,0,0, clock=0,42,5,0,0,0,0,0,0, objs=(pkt:47,raw:30,fua:16,msg:162,oth:1,buf:27)
[2023-09-19 09:32:48.786][INFO][85369][79696179] cleanup for quit signal fast=0, grace=1
[2023-09-19 09:32:48.794][INFO][85369][i67224b2] Process: cpu=0.00%,0MB, threads=2
[2023-09-19 09:32:48.794][INFO][85369][i67224b2] quit for thread #2(hybrid) finished

Solution

See https://gitee.com/ossrs/srs/pulls/2/files

void SrsServerAdapter::stop()
{
+  if (srs) {
+    srs->stop();
+  }
}
winlinvip commented 9 months ago

Right now, only SrsServer can receive the 'sigquit' message from the signal manager. RtcServer doesn't have this feature, so when we send SIGQUIT to SRS, only the services in srsServer can be stopped. The UDP servers in rtcServer don't stop. So, new webrtc stream requests for 'stun' still go to the old process.

TRANS_BY_GPT4

winlinvip commented 9 months ago

I guess SRT and GB haven't implemented this Graceful QUIT feature yet.

TRANS_BY_GPT4

winlinvip commented 9 months ago

It's not easy to implement Gracefully Quit in the UDP protocol because UDP doesn't have a state.

Also, the whole WebRTC system reuses UDP ports. Right now, it seems to be reusing FD, so it needs special handling.

As for SRT, its UDP is managed by the underlying library. To implement it, we might need to take a closer look at how the underlying system works.

TRANS_BY_GPT4

winlinvip commented 9 months ago

Just to add: You don't really need to worry about WebRTC's UDP. All you need to do is turn off the HTTP API, and it will be supported by default. This is because the main point of Gracefully Quit is to stop accepting new connections and then force an exit after a certain period of time.

SRT might be a bit more challenging because its new connections are also on UDP. Currently, GB only supports TCP, so you can just close the Listener.

TRANS_BY_GPT4

xiaozhihong commented 9 months ago

To gracefully shut down SRT, you just need to close the listener. If there are no current SRT connections, the UDP socket will be unbound. If there are, it will wait until all connections are gone before unbinding. During the UDP unbinding period, if a new connection comes in, the handshake packet will be received, but it will fail because the SRT listener has already been closed. If the client tries a few times, it might be able to send the request to a new server.

TRANS_BY_GPT4

jinleileiking commented 4 months ago
void SrsServerAdapter::stop()
{
+  if (srs) {
+    srs->stop();
+  }
}

I have test this patch. TCP seems work well.

jinleileiking commented 4 months ago

A relatively straightforward implementation would be that upon receiving a kill signal, both UDP and TCP cease to listen, and the program initiates an actual exit through another signal. The practical steps are as follows:

  1. Receive SIGTERM signal.
  2. Cluster B starts up.
  3. Cluster A waits for N minutes (to allow time for Cluster B to start successfully) and does not listen for new connections.
  4. Cluster B begins to accept new traffic.
  5. The traffic on Cluster A gradually diminishes (possibly over the course of one month).
  6. Manually inspect the traffic on Cluster A to ensure there is no remaining traffic before killing the pod.
  7. After receiving the kill command five times, the program exits directly.

TRANS_BY_GPT4

winlinvip commented 4 months ago

This mechanism has issues that need to be examined when time permits. It would be even better if you could submit a pull request to resolve them.

TRANS_BY_GPT4