robur-coop / albatross

Albatross: orchestrate and manage MirageOS unikernels with Solo5
ISC License
142 stars 17 forks source link

albatrossd did not clear up tap devices on shutdown #37

Closed hannesm closed 3 years ago

hannesm commented 4 years ago

a service albatross_daemon stop killed all the unikernels, but left the tap devices in place. this was with 2d26a56c0d21bf9bf195a9dd448cf90f83ea2202. need to investigate (stress test) whether this is still an issue with more current albatross.

hannesm commented 4 years ago

system log:

Jul 29 12:03:59 <kern.info> beast kernel: tap6: link state changed to DOWN
Jul 29 12:03:59 <daemon.notice> beast daemon[84677]: albatrossd: [WARNING] unikernel [vm: YYY] solo5 exit failure (1)
Jul 29 12:03:59 <daemon.notice> beast daemon[84638]: albatross_stats: [INFO] removing vmid [vm: ZZZ]
Jul 29 12:03:59 <daemon.notice> beast daemon[84638]: albatross_stats: [INFO] removing pid AAAA

the "solo5 exit failure (1)" is from Vmm_core.should_restart, but for some reason the waitpid handler was not executed (and the new albatross_log failed to read entries with version 0x2 /o\ -- unclear whether the Unikernel_stop log entry was dumped)

hannesm commented 4 years ago

from the albatross log, there's only a few "stopped unikernel YYY" -- the sleep 1 in https://github.com/hannesm/albatross/blob/1b1164166b409ec4daf6ea87d067031a7bc9973a/daemon/albatrossd.ml#L176-L186 should be revised (with appropriate waiters / inversion of control / eventually Lwt.join)