If code running in a container has opened a socket and recently called close on it, then there is no clear way to checkpoint the container; any checkpoint attempts will likely fail because the socket is still in TIME_WAIT state in the kernel. As a result, it's not clear how to checkpoint containers when we've recently opened and closed TCP sockets.
I'm running more-or-less the following python code in the container:
import socket
import requests
class SnapshotClass:
def __init__(self):
self._socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
def lingering_open_sockets(self):
remote_ip = "94.140.14.14" # AdGuard DNS
self._socket.connect((remote_ip, 80))
self._socket.close()
# checkpoint the container here
s = SnapshotClass()
s.lingering_open_sockets();
Doing this results in the following error:
Failed creating function container snapshot: internal runtime error: failed to snapshot container => container_id=ta-01J5EAD3M1T0N41P2RA165X5A4 stderr=checkpoint failed: checkpointing container "ta-01J5EAD3M1T0N41P2RA165X5A4": encoding error: save rejected due to unsupported networking state: endpoint cannot be saved in connected state: local 172.20.0.86:30031, remote 94.140.14.14:80:
We can avoid the error if we wait 60 seconds before checkpoint, which is the default period of time that sockets spend in TIME_WAIT after closing; but that's a long time to wait. I've tried setting sysctl settings in the config.json given to runsc to ensure that the sockets reach the CLOSE state quickly (e.g. net.ipv4.tcp_fin_timeout: "1"), but those settings don't seem to be respected inside the container. (I can file this bug separately.) From outside the container, we can close sockets in the Linux kernel, but not in Netstack.
Can I ask for some guidance on how to deal with this situation properly?
Steps to reproduce
No response
runsc version
runsc version 40a09da5a1ab
spec: 1.1.0-rc.1
docker version (if using docker)
No response
uname
Linux 5.15.0-101.103.2.1.el9uek.x86_64 #2 SMP Tue May 2 01:10:45 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Description
If code running in a container has opened a socket and recently called
close
on it, then there is no clear way to checkpoint the container; any checkpoint attempts will likely fail because the socket is still in TIME_WAIT state in the kernel. As a result, it's not clear how to checkpoint containers when we've recently opened and closed TCP sockets.I'm running more-or-less the following python code in the container:
Doing this results in the following error:
We can avoid the error if we wait 60 seconds before checkpoint, which is the default period of time that sockets spend in TIME_WAIT after closing; but that's a long time to wait. I've tried setting sysctl settings in the config.json given to runsc to ensure that the sockets reach the CLOSE state quickly (e.g.
net.ipv4.tcp_fin_timeout: "1"
), but those settings don't seem to be respected inside the container. (I can file this bug separately.) From outside the container, we can close sockets in the Linux kernel, but not in Netstack.Can I ask for some guidance on how to deal with this situation properly?
Steps to reproduce
No response
runsc version
docker version (if using docker)
No response
uname
Linux 5.15.0-101.103.2.1.el9uek.x86_64 #2 SMP Tue May 2 01:10:45 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response