kontainapp / km

Kontain Monitor
http://kontain.app
Apache License 2.0
30 stars 5 forks source link

snapshot recovery fails when one of eventfd's listeners is in accept state #1661

Open sv641 opened 2 years ago

sv641 commented 2 years ago

I took a snapshot of go server provided by knative examples. snapshot created fails to start with error

[h1@kubernetes km]$ ./build/km/km ../snap/kmsnap 
18:47:10.204619 km_ss_recover_km_mon 401  km      label: /server 1 /tmp/kmmgmt.pipe-1
18:47:10.204755 km_ss_recover_km_mon 404  km      description: snapshot date Thu Aug 18 18:38:07 2022

18:47:10.208519 km_fs_recover_eventf 2966 km      monitored fd=5 does not exist
18:47:10.208529 km_fs_recover        2995 km      recover open files failed
[h1@kubernetes km]$ 
sv641 commented 2 years ago

Further debugging the snapshot recover with lots of help from @paul we found that when socket is in

if (nt_sock->how == KM_FILE_HOW_ACCEPT) { return km_fs_recover_socket_accepted(nt_sock); }

static int km_fs_recover_socket_accepted(km_nt_socket_t* nt_sock) { return 0; }

In memory filesys entry is not being made and this is causing eventfd creation to bail out.

sv641 commented 2 years ago

fd state of process being snapshoted lrwx------ 1 root root 64 Aug 18 18:35 0 -> /dev/null l-wx------ 1 root root 64 Aug 18 18:35 1 -> pipe:[110166] l-wx------ 1 root root 64 Aug 18 18:35 2 -> pipe:[110167] lrwx------ 1 root root 64 Aug 18 18:35 3 -> socket:[109236] lrwx------ 1 root root 64 Aug 18 18:35 5 -> socket:[109260] lrwx------ 1 root root 64 Aug 18 18:35 6 -> anon_inode:[eventpoll] lrwx------ 1 root root 64 Aug 18 18:38 729 -> socket:[114797] l-wx------ 1 root root 64 Aug 18 23:05 731 -> /tmp/km_1.log lrwx------ 1 root root 64 Aug 18 23:05 732 -> anon_inode:[eventfd] lrwx------ 1 root root 64 Aug 18 23:05 733 -> anon_inode:[eventfd] lrwx------ 1 root root 64 Aug 18 23:05 734 -> /dev/kvm lrwx------ 1 root root 64 Aug 18 23:05 735 -> anon_inode:kvm-vm lrwx------ 1 root root 64 Aug 18 23:05 736 -> anon_inode:kvm-vcpu:0 lrwx------ 1 root root 64 Aug 18 23:05 737 -> anon_inode:kvm-vcpu:1 lrwx------ 1 root root 64 Aug 18 23:05 738 -> anon_inode:kvm-vcpu:2 lrwx------ 1 root root 64 Aug 18 23:05 739 -> anon_inode:kvm-vcpu:3 lrwx------ 1 root root 64 Aug 18 23:05 740 -> anon_inode:kvm-vcpu:4 lrwx------ 1 root root 64 Aug 19 01:59 741 -> anon_inode:kvm-vcpu:5 lrwx------ 1 root root 64 Aug 19 01:59 742 -> anon_inode:kvm-vcpu:6 lrwx------ 1 root root 64 Aug 19 16:22 743 -> anon_inode:kvm-vcpu:7 lrwx------ 1 root root 64 Aug 19 16:22 744 -> anon_inode:kvm-vcpu:8

(gdb) p/x *$14 $17 = {nfdmap = 0x2d7, guest_files = 0x7ffff8000920} (gdb) p/x $17->guest_files[3] $18 = {inuse = 0x1, how = 0x4, flags = 0x0, error = 0x0, ops = 0x0, ofd = 0xffffffff, name = 0x7ffff801bd60, sockinfo = 0x7ffff8014c30, events = {tqh_first = 0x0, tqh_last = 0x7ffff8000a10}} (gdb) p/x $17->guest_files[5] $19 = {inuse = 0x0, how = 0x0, flags = 0x0, error = 0x0, ops = 0x0, ofd = 0x0, name = 0x0, sockinfo = 0x0, events = {tqh_first = 0x0, tqh_last = 0x0}} (gdb) b 2963 Breakpoint 2 at 0x7ffff7e8d9f2: file km/km_filesys.c, line 2963. (gdb) c Continuing.

Breakpoint 2, km_fs_recover_eventfd (ptr=0x7ffff80222a0 "\024", length=) at km/km_filesys.c:2964 2964 int host_efd = km_fs_g2h_fd(nt_event->fd, NULL); (gdb) p/x *nt_event $20 = {fd = 0x5, event = 0x80002005, data = 0x7ffe235e38} (gdb) p/x $17->guest_files[5] $21 = {inuse = 0x0, how = 0x0, flags = 0x0, error = 0x0, ops = 0x0, ofd = 0x0, name = 0x0, sockinfo = 0x0, events = {tqh_first = 0x0, tqh_last = 0x0}}

paulpopelka commented 2 years ago

In function km_fs_recover_open_socket(), this block of code needs to be removed:

   if (nt_sock->how == KM_FILE_HOW_ACCEPT) {
      return km_fs_recover_socket_accepted(nt_sock);
   }

I think the above change will allow an accepted socket to be recovered in such a way that the first recv or send operation will return ECONNRESET which will cause the payload to abandon the connection and resume listening for a new connection.

paulpopelka commented 2 years ago

The bats snapshot_test.c needs to test listening and connected sockets to be sure they are recovered properly too.

sv641 commented 2 years ago

guest is a client and we are expecting the server to keep the session alive?

On Tue, Oct 25, 2022 at 3:24 PM paulpopelka @.***> wrote:

The bats snapshot_test.c needs to test listening and connected sockets to be sure they are recovered properly too.

— Reply to this email directly, view it on GitHub https://github.com/kontainapp/km/issues/1661#issuecomment-1291205715, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANVVZGELTLXQPGT5KKITVN3WFBMYNANCNFSM57BUJ2AA . You are receiving this because you authored the thread.Message ID: @.***>

paulpopelka commented 2 years ago

guest is a client and we are expecting the server to keep the session alive?

The connection will be lost when the snapshot is recovered. When I/O is attempted on the fd for the lost connection km will cause ECONNRESET to be returned to the payload. We assume the payload will be able to handle this by cleaning up whatever it was trying to do on the connection and then reconnect to retry whatever it was doing.