google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.63k stars 1.29k forks source link

Validate route on a TCP retransmit #4687

Closed hbhasker closed 3 years ago

hbhasker commented 3 years ago

Description TCP sockets today don't validate the route when timeouts occur. This causes an issue where say the route disappears or a nic is removed from the system. In such cases Netstack TCP sockets would keep trying to retransmit even though the route is probably unusable due to either the NIC being removed or the address being removed from the underlying NIC.

WritePacket() will correctly fail all writes https://cs.opensource.google/gvisor/gvisor/+/master:pkg/tcpip/stack/route.go;drc=1f0f687cbe49c4af272abc47d5d974e86fef6c01;l=206 but today in TCP these errors are ignored and we don't react to them. Also EINVAL is probably the incorrect error to return.

We should either return EHOSTUNREACH here or on a retransmit validate the route before using it similar to how linux does (see: https://github.com/torvalds/linux/blob/9ff9b0d392ea08090cd1780fb196f36dbb586529/net/ipv4/af_inet.c#L1276)

In fact linux even attempts to find a new route to the destination and only fails if there are no valid routes anymore.

This method is assigned to tcp->rebuild_header here https://github.com/torvalds/linux/blob/9ff9b0d392ea08090cd1780fb196f36dbb586529/net/ipv4/tcp_ipv4.c#L2139

Which is then called on the retransmit_skb path here https://github.com/torvalds/linux/blob/9ff9b0d392ea08090cd1780fb196f36dbb586529/net/ipv4/tcp_output.c#L3163

Is this feature related to a specific bug? TCP sockets can live for a long time even after NIC/Route is not valid anymore.

Do you have a specific solution in mind? See above:

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 90 days with no activity. Remove the stale label or comment or this will be closed in 30 days.