pion / webrtc

Pure Go implementation of the WebRTC API
https://pion.ly
MIT License
13.41k stars 1.63k forks source link

DataChannel breaks when machine is under load (race condition?) #2152

Closed iamcalledrob closed 2 years ago

iamcalledrob commented 2 years ago

Your environment.

What did you do?

I've been running into this when using dataChannels in a high-throughput scenario. As a simple repro, the problem is demonstrable using the built-in "data-channels-flow-control" example.

Easy repro steps:

  1. Run the data-channels-flow-control example. (cd examples/data-channels-flow-control; go run main.go)
  2. Make the CPU work hard, maxing it out. (yes > /dev/null & x number of cores to keep them all busy)

What did you expect?

  1. The throughput to remain reliable, potentially with lower throughput.

What happened?

  1. The example begins working and showing throughput, e.g. Throughput: 632.810 Mbps.
  2. top shows meaningful CPU usage (~250%) and the process with the STATE of running most of the time.
  3. Soon after the machine is under load, a mux error is printed: mux ERROR: 2022/03/18 14:18:20 mux: ending readLoop dispatch error packetio.Buffer is full, discarding write
  4. After the error is printed, no further data is transferred. The write buffer is no longer automatically flushed. The process goes to sleep.
  5. In the data-channels-flow-control example, the effect is to show the average throughput dropping endlessly.

It seems like there may be a race condition regarding this buffer that's more easily triggered under load?

Notably, in internal/mux/mux.go, there's a comment when setting the packetio buffer suggesting that the buffer should never be able to be full. https://github.com/pion/webrtc/blob/655daa9689576421c194059dd4f3a05cae544a07/internal/mux/mux.go#L59-L62

Video of repro attached.

https://user-images.githubusercontent.com/87964/159092362-3dae173a-5c4e-4117-99ea-e993b2c47fd8.mp4

iamcalledrob commented 2 years ago

Relatedly, the same issue can be seen in the TestStressDuplex test by increasing the MsgCount (mux_test.go, Line 39)

https://github.com/pion/webrtc/blob/655daa9689576421c194059dd4f3a05cae544a07/internal/mux/mux_test.go#L30-L43

On my machine:

Screen cap:

https://user-images.githubusercontent.com/87964/159099838-c3a1d146-68aa-46bc-bdea-6e6ee02f8656.mp4