liquidmetal-dev / flintlock

Lock, Stock, and Two Smoking MicroVMs. Create and manage the lifecycle of MicroVMs backed by containerd.
https://flintlock.liquidmetal.dev/
Mozilla Public License 2.0
647 stars 36 forks source link

VM Supervisor #198

Open jmickey opened 3 years ago

jmickey commented 3 years ago

The Supervisor will be responsible for monitoring running MicroVMs and react to changes that drift from the desired state.

Why do we need this?

We don't currently continuously monitor the state of running VMs. If a VM drifts away from its desired state - e.g. A VM crashes and is in a failed state - we need to wait until the next time the reconciler runs for the VM to be recreated/restarted.

Additionally, we don't currently track if a VM is continuously failing. Flintlock will continue to recreate the VM every time a resync occurs. The reconciler doesn't know if a VM has already been started, as far as it is concerned it only cares about reconciling the existing state to the desired state.

What do we need

How this looks on an implementation level is unknown, and it's likely one or more ADRs will need to be produced as a result.

Subtasks

richardcase commented 3 years ago

I don't think the supervisor itself will store events......just raise them. And it probably shouldn't make any modifications to the microvm spec....so its a read-only consumer of the specs. wdyt?

jmickey commented 3 years ago

@richardcase The usage of Events here might be a little overloaded. By "Events" I am more referring to a running kind of log I guess? Similar to the events that are shown when you kubectl describe a resource. An "event" in this case might be that the supervisor has detected that a previously running VM is no longer running, if that makes sense?

Maybe that is actually part of the reconciler, wdyt?

I don't think the supervisor should make changes to the spec, but maybe the status? Again, maybe I didn't think it through enough and it actually belongs in the reconciler.

jmickey commented 3 years ago

Actually, maybe you're right. It should probably be the reconciler that updates if a VM has been started, how many times it's been restarted, etc. Then it can also control the back-off if we choose to implement one in the future.

e.g. It can mark the VM as CrashLoopBackOff (I couldn't think of a better name so I just went with the Kubernetes vernacular) and create a gradually increasing ticker to retry?

richardcase commented 3 years ago

As part of the implementation, we need to revisit the sleep introduced in #255....and hopefully remove the need for it.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity.

richardcase commented 2 years ago

This is still required

Callisto13 commented 2 years ago

Would be good to have this soon. I just started flintlock on a machine which I apparently did not clean up and have rebooted a couple of times since I last did LM, and flintlock is like "woah look how many mvms I have" and I am like "bruh, there are no firecracker processes running".

richardcase commented 2 years ago

We should also look at this again: https://github.com/asynkron/protoactor-go

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] commented 6 months ago

This issue was closed because it has been stalled for 365 days with no activity.