pion / ice

A Go implementation of ICE
https://pion.ly/
MIT License
454 stars 160 forks source link

UDPMux causes massive packet loss #518

Open jech opened 3 years ago

jech commented 3 years ago

Testing UDPMux in Galene, I'm seeing absolutely massive packet loss on a local network, on the order of 50-70%.

The code is here: https://github.com/jech/galene/commit/b80e515eb04a8326336524ea80ecf711a3013293

OrlandoCo commented 3 years ago

@jech Have you tried to increase OS UDP buffer??

jech commented 3 years ago

No. My current hypothesis is that it's the same issue as https://github.com/pion/webrtc/issues/1356, which is apparently due to having multiple local addresses on a single local socket; that's going to happen on double-stack hosts, as well as on multihomed hosts.

davidzhao commented 3 years ago

@jech we are using this (and TCPMux) with LiveKit and only see packet loss when UDP buffer gets overwhelmed. (increasing it gets rid of the loss for us).

https://github.com/livekit/livekit-server/blob/master/pkg/rtc/config.go#L70

Do you want to see if you can repro with LiveKit? I'm wondering there's something unique to your machine's networking stack.

You can start it with docker, and using UDPMux

docker run --rm \
  -p 7880:7880 \
  -p 7881:7881 \
  -p 7882:7882/udp \
  -e LIVEKIT_KEYS="<key>: <secret>" \
  livekit/livekit-server \
  --dev \
  --node-ip=<machine-ip>
jech commented 3 years ago

Are your machines double-stack?

davidzhao commented 3 years ago

what is considered double-stack? having both ipv4/6?

jech commented 3 years ago

what is considered double-stack? having both ipv4/6?

Yes.

davidzhao commented 3 years ago

With livekit we are using UDP4 with the mux and that could explain the difference. The challenge with dual-stack is ensuring what's advertised to match the dest addr that we sent to. I remember seeing some oddities along the lines of:

jech commented 3 years ago

I don't know if the issue is the same as https://github.com/pion/webrtc/issues/1356 (which has higher priority for me), but that issue goes away when I disable IPv6 (see https://github.com/pion/webrtc/issues/1356#issuecomment-894376345). Disabling IPv6 is of course not an option (IPv6 is great for WebRTC, IPv6 gives you a peer-reflexive candidate straight away, without the need to contact a STUN, which noticably reduces the connexion establishment delay).

davidzhao commented 3 years ago

Disabling IPv6 is of course not an option

I agree having IPv6 is nice, but I would question if it's a must have. Is the slight connection speed improvement worth not having ICE/TCP? that is the decision today.

ofc it'd be ideal to fix the underlying issue.

jech commented 3 years ago

The workaround is not a simple matter of disabling IPv6 for TCP-ICE — it requieres disabling IPv6 globally on the host. This means that you'll run into trouble as soon as somebody runs your code on a modern server.

What's more, the issue indicates that the code is buggy. Until the bug is understood and properly fixed, there's no saying when the code will bite you. Most probably during an important demo ;-)

davidzhao commented 3 years ago

It'll only cause an issue on servers that don't support IPv4. we have not gotten any feedback about this. AFAIK, all major cloud vendors run their machines with dual stack.

But I digress, let's just fix the underlying issue.

stv0g commented 1 year ago

Lets move this issue to pion/ice..