google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.63k stars 1.29k forks source link

Checkpointing a running container with active tcp connections #113

Closed erulabs closed 5 years ago

erulabs commented 5 years ago

Hello!

Love gvisor, thank you for your work so far! I'm using checkpoints in a situation where I don't want or need TCP connections to be check-pointed. Using runc and criu (https://github.com/checkpoint-restore/criu), I can make a modification to criu which simply skips accounting for TCP handles entirely. Unfortunately with gvisor I get an error from netstack:

checkpoint failed Error: (HTTP code 500) server error - Cannot checkpoint container 4b91dd8f63fecd15fafa25405205ef6442f75a20440fb33753f903a1ce76e7b8: /usr/local/bin/runsc did not terminate sucessfully: checkpoint failed: err checkpointing container "4b91dd8f63fecd15fafa25405205ef6442f75a20440fb33753f903a1ce76e7b8": save rejected due to unsupported networking state: endpoint cannot be saved in connected state: local 172.21.0.15:7676, remote 172.21.0.1:44304: path= /var/run/evaldocker/containerd/daemon/io.containerd.runtime.v1.linux/moby/4b91dd8f63fecd15fafa25405205ef6442f75a20440fb33753f903a1ce76e7b8/criu-dump.log: unknown

I'll work on a smaller re-creation, and this isn't a bug certainly. I've been reading through the code here https://github.com/google/gvisor/blob/master/pkg/tcpip/transport/tcp/endpoint_state.go and wondering about CapabilitySaveRestore or CapabilityDisconnectOk or maybe modifying the checkpointing to ignore network connection states entirely. In CRIU, my modification is as simple as preventing the tcp snapshot process entirely: https://github.com/erulabs/criu/commit/69db3bec100b637169dfa3490afda7ee5accb366

If this is something that is already possible and I'm missing it, or is possibly something that could be added, I would appreciate any help! Even if its not something you'd like to expose as an option, I wonder if someone out there with a better handle on the codebase would know where to start and I'd be happy to code it myself. Thank you!

Edit: Is it possible I just need to find out how to make stack.CapabilityDisconnectOk true?

zhaozhongn commented 5 years ago

Hi Seandon,

I assume by "don't want or need TCP connections to be check-pointed" you meant any active TCP connections upon checkpoint can be safely dropped in the background. In that case, yes, it should be as simple as making stack.CapabilityDisconnectOk true (while keeping stack.CapabilitySaveRestore false) for all NICs with active TCP connections.

We do plan to expose these options to the runsc configuration---we just have not got to do that yet. (One decision we have not made is whether to expose the options per NIC or as a global setting, or both.) You are more than welcome to code it now:-).

Thanks,

erulabs commented 5 years ago

Hello! Sorry for leaving this open - that did the trick - there is a chance I might get around to exposing that option, but I will make a new PR for that and not keep this issue open.

Thank you!