kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.08k stars 1k forks source link

runtime: avoid panic on metrics gathering #9827

Closed littlejawa closed 1 week ago

littlejawa commented 2 weeks ago

While running with a remote hypervisor, whenever kata-monitor tries to access metrics from the shim, the shim does a "panic" and no metric can be gathered.

The function GetVirtioFsPid() is called on metrics gathering, and had a call to "panic()". Since there is no virtiofs process for remote hypervisor, the right implementation is to return nil. The caller expects that, and will skip metrics gathering for virtiofs.

Fixes: #9826

wainersm commented 1 week ago

Hi @littlejawa !

We don't cover remote hypervisor on Kata tests (possible metrics are not even tests on peer pods, @stevenhorsman), so we didn't catch that before . Do you know what other component(s) might this GetVirtioFsPid() ? I'm just wondering what else might be broken.

littlejawa commented 1 week ago

Hi @littlejawa !

We don't cover remote hypervisor on Kata tests (possible metrics are not even tests on peer pods, @stevenhorsman), so we didn't catch that before . Do you know what other component(s) might this GetVirtioFsPid() ? I'm just wondering what else might be broken.

Hey @wainersm,

I've looked through the code, and the only place where this function is called is here

All other references to it are either implementation of the function by hypervisor management code, or test code. So we should be safe now.

Some additional notes:

wainersm commented 1 week ago

Hi @littlejawa ! We don't cover remote hypervisor on Kata tests (possible metrics are not even tests on peer pods, @stevenhorsman), so we didn't catch that before . Do you know what other component(s) might this GetVirtioFsPid() ? I'm just wondering what else might be broken.

Hey @wainersm,

I've looked through the code, and the only place where this function is called is here

All other references to it are either implementation of the function by hypervisor management code, or test code. So we should be safe now.

Some additional notes:

* There are other "panic()" calls in the remote.go file, in the "fromGrpc()" and "toGrpc()" functions. It seems to be called when the "factory" code is exercised.
  In any case, if it was exercised, we would get a panic on VM creation, so I think it would be obvious.

* There is also a bunch of "notimplemented()" error returns in the remote.go file, but they are on features that we actually don't support (yet?) with remote runtimes, so returning an error is fine.

Glad to hear that we are good now. Thanks @littlejawa !