Open jech opened 3 years ago
@jech Have you tried to increase OS UDP buffer??
No. My current hypothesis is that it's the same issue as https://github.com/pion/webrtc/issues/1356, which is apparently due to having multiple local addresses on a single local socket; that's going to happen on double-stack hosts, as well as on multihomed hosts.
@jech we are using this (and TCPMux) with LiveKit and only see packet loss when UDP buffer gets overwhelmed. (increasing it gets rid of the loss for us).
https://github.com/livekit/livekit-server/blob/master/pkg/rtc/config.go#L70
Do you want to see if you can repro with LiveKit? I'm wondering there's something unique to your machine's networking stack.
You can start it with docker, and using UDPMux
docker run --rm \
-p 7880:7880 \
-p 7881:7881 \
-p 7882:7882/udp \
-e LIVEKIT_KEYS="<key>: <secret>" \
livekit/livekit-server \
--dev \
--node-ip=<machine-ip>
Are your machines double-stack?
what is considered double-stack? having both ipv4/6?
what is considered double-stack? having both ipv4/6?
Yes.
With livekit we are using UDP4 with the mux and that could explain the difference. The challenge with dual-stack is ensuring what's advertised to match the dest addr that we sent to. I remember seeing some oddities along the lines of:
I don't know if the issue is the same as https://github.com/pion/webrtc/issues/1356 (which has higher priority for me), but that issue goes away when I disable IPv6 (see https://github.com/pion/webrtc/issues/1356#issuecomment-894376345). Disabling IPv6 is of course not an option (IPv6 is great for WebRTC, IPv6 gives you a peer-reflexive candidate straight away, without the need to contact a STUN, which noticably reduces the connexion establishment delay).
Disabling IPv6 is of course not an option
I agree having IPv6 is nice, but I would question if it's a must have. Is the slight connection speed improvement worth not having ICE/TCP? that is the decision today.
ofc it'd be ideal to fix the underlying issue.
The workaround is not a simple matter of disabling IPv6 for TCP-ICE — it requieres disabling IPv6 globally on the host. This means that you'll run into trouble as soon as somebody runs your code on a modern server.
What's more, the issue indicates that the code is buggy. Until the bug is understood and properly fixed, there's no saying when the code will bite you. Most probably during an important demo ;-)
It'll only cause an issue on servers that don't support IPv4. we have not gotten any feedback about this. AFAIK, all major cloud vendors run their machines with dual stack.
But I digress, let's just fix the underlying issue.
Lets move this issue to pion/ice..
Testing UDPMux in Galene, I'm seeing absolutely massive packet loss on a local network, on the order of 50-70%.
The code is here: https://github.com/jech/galene/commit/b80e515eb04a8326336524ea80ecf711a3013293