pion / webrtc

Pure Go implementation of the WebRTC API
https://pion.ly
MIT License
13.68k stars 1.65k forks source link

`peerConnection.Close()` hangs forever (deadlock?) while handling a callback from Pion #2404

Closed daniel-abramov closed 5 months ago

daniel-abramov commented 1 year ago

What did you do?

A little bit of a context: I'm working on the SFU, currently, we have the following rough architecture:

This approach worked well when we used buffered channels with quite a large buffer, but occasionally the conference appeared to be frozen. I replaced the buffered channel with an unbounded channel (i.e. the sender blocks until the receiver reads the message). Right after that, I observed that quite often the conference gets frozen by indefinitely waiting inside a peerConnection.Close() method.

After some investigations, I've noticed that it primarily happens when shortly before calling the Close() function, we receive a callback for OnICECandidate. Close() hangs forever and never returns.

Here is the callback:

runtime.gopark (/opt/homebrew/Cellar/go/1.19.4/libexec/src/runtime/proc.go:364)
runtime.chanrecv (/opt/homebrew/Cellar/go/1.19.4/libexec/src/runtime/chan.go:583)
runtime.chanrecv1 (/opt/homebrew/Cellar/go/1.19.4/libexec/src/runtime/chan.go:442)
github.com/pion/ice/v2.(*Agent).Close (/Users/danielabramov/go/pkg/mod/github.com/pion/ice/v2@v2.2.3/agent.go:920)
github.com/pion/webrtc/v3.(*ICEGatherer).Close (/Users/danielabramov/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.31/icegatherer.go:185)
github.com/pion/webrtc/v3.(*ICETransport).Stop (/Users/danielabramov/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.31/icetransport.go:198)
github.com/pion/webrtc/v3.(*PeerConnection).Close (/Users/danielabramov/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.31/peerconnection.go:1982)

It hangs here:

https://github.com/pion/ice/blob/4dea7246b63bd6230dcb40b14c77679a10844e62/agent.go#L927

It seems like there is some race or a deadlock in Pion that acquires certain resources when calling the OnICECandidate callback which is then expected to be released by the moment the callback finishes. I.e. essentially the workaround that we found is to never block or wait for any I/O inside the OnICECandidate callback, but we're a bit concerned that there might be more places where such a thing can occur. It seems like sometimes it hangs for no apparent reason.

What did you expect?

I expected that Close() does not hang forever. Ideally, the behavior for the callbacks should be unified and predictable unless there is a reason to have different requirements for the callbacks that we pass to Pion (I think it probably would be a good idea to document them), i.e. if the callbacks are not allowed to block of it it's not allowed to access the peer connection from the callback closure, it probably needs to be documented. This happens not every time which gives a hint that there must be a race somewhere inside Pion that leads to a deadlock.

What happened?

peerConnection.Close() hung forever.

edaniels commented 1 year ago

We have a similar issue and workaround in our rpc library https://github.com/viamrobotics/goutils/blob/2d9c3aeea721c2d61835e5b94db63dfff4c3ece9/rpc/wrtc_client.go#L231

jech commented 1 year ago

I'm not sure whether this is a bug. The callbacks are called synchronously, if they need to block, then you should make them asynchronous yourself. This is the usual idiom in Go, where making a synchronous function asynchronous is easy (just add the go keyword), but making an asynchronous invocation synchronous is difficult (requires either a channel or a workgroup).

Sean-Der commented 5 months ago

Thanks for filing this @daniel-abramov I have had users ask about this/be confused. I think this is best for users

I am going to merge this into #2405