kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.09k stars 1.01k forks source link

Agent CPU onlining code does not retry in case of error #3809

Open c3d opened 2 years ago

c3d commented 2 years ago

The agent's code right now onlines CPUs as follows:

fn online_cpus(logger: &Logger, num: i32) -> Result<i32> {
    let mut onlined_count: i32 = 0;

    for i in 0..ONLINE_CPUMEM_MAX_RETRIES {
        let r = online_resources(
            logger,
            SYSFS_CPU_ONLINE_PATH,
            r"cpu[0-9]+",
            num - onlined_count,
        );

        onlined_count += r?;
        if onlined_count == num {
            info!(logger, "online {} CPU(s) after {} retries", num, i);
            return Ok(num);
        }
        thread::sleep(time::Duration::from_millis(ONLINE_CPUMEM_WATI_MILLIS));
    }

    Err(anyhow!(
        "failed to online {} CPU(s) after {} retries",
        num,
        ONLINE_CPUMEM_MAX_RETRIES
    ))
}

I believe that the r? to add to onlined_count is incorrect, since that means an error from online_resources interrupts the retry loop.

@liubin You introduced that in bcce1a1911fea04b4c19ad8f3d10cb3a3e423391. Would you agree that it would be better to use is_err() to avoid exiting the function in case of error, but instead retry?

c3d commented 2 years ago

I believe this could be the root cause for #3793 , where we see some CPUs staying off-line.

liubin commented 2 years ago

@c3d Don't break the retry loop and give it more retries is not bad at least.

But I checked the online_resources, should this maybe a cause?

https://github.com/kata-containers/kata-containers/blob/934788eb530fdb5194c9264621cc879219f9d27e/src/agent/src/sandbox.rs#L406-L416

line 407 fs::write(file.as_str(), "1") may return Ok, but the content of that file could be still 0? In that case line 411 count += 1 will count a not onlined CPU as an online CPU.

So check the file content after writing and add the counter when the content really becomes 1 will be acceptable.

c3d commented 2 years ago

line 407 fs::write(file.as_str(), "1") may return Ok, but the content of that file could be still 0? In that case line 411 count += 1 will count a not onlined CPU as an online CPU.

We have logs that seem to show that the boot sequence indeed supports that situation. However, it may happen "long after" we wrote a one.

The uevents we see are this (cleaned up):

Mar 02 15:23:04 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1226", interface: "" }"}"
Mar 02 15:23:04 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1227", interface: "" }"}
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1228", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1229", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1230", interface: "" }"}
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1231", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu4", devname: "", subsystem: "cpu", seqnum: "1232", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu4", devname: "", subsystem: "cpu", seqnum: "1233", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1234", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1235", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu6", devname: "", subsystem: "cpu", seqnum: "1236", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu6", devname: "", subsystem: "cpu", seqnum: "1237", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu5", devname: "cpu/5/cpuid", subsystem: "cpuid", seqnum: "1239", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1242", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu5", devname: "cpu/5/cpuid", subsystem: "cpuid", seqnum: "1244", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1246", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu2", devname: "cpu/2/cpuid", subsystem: "cpuid", seqnum: "1248", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1251", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu2", devname: "cpu/2/cpuid", subsystem: "cpuid", seqnum: "1253", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1255", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu3", devname: "cpu/3/cpuid", subsystem: "cpuid", seqnum: "1257", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1260", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu1", devname: "cpu/1/cpuid", subsystem: "cpuid", seqnum: "1262", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1265", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu2", devname: "cpu/2/cpuid", subsystem: "cpuid", seqnum: "1267", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1269", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu1", devname: "cpu/1/cpuid", subsystem: "cpuid", seqnum: "1271", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1273", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu6", devname: "cpu/6/cpuid", subsystem: "cpuid", seqnum: "1275", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu6", devname: "", subsystem: "cpu", seqnum: "1278", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu6", devname: "cpu/6/cpuid", subsystem: "cpuid", seqnum: "1280", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu6", devname: "", subsystem: "cpu", seqnum: "1282", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu4", devname: "cpu/4/cpuid", subsystem: "cpuid", seqnum: "1284", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu4", devname: "", subsystem: "cpu", seqnum: "1287", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu4", devname: "cpu/4/cpuid", subsystem: "cpuid", seqnum: "1289", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu4", devname: "", subsystem: "cpu", seqnum: "1291", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu3", devname: "cpu/3/cpuid", subsystem: "cpuid", seqnum: "1293", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1295", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu5", devname: "cpu/5/cpuid", subsystem: "cpuid", seqnum: "1297", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1299", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu3", devname: "cpu/3/cpuid", subsystem: "cpuid", seqnum: "1301", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1303", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu1", devname: "cpu/1/cpuid", subsystem: "cpuid", seqnum: "1305", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1307", interface: "" }"}"

So for example, for cpu4, we have this (even more cleaned up):

Mar 02 15:23:05 add /devices/system/cpu/cpu4 
Mar 02 15:23:05 bind /devices/system/cpu/cpu4 
Mar 02 15:23:05 add /devices/virtual/cpuid/cpu4 
Mar 02 15:23:05 online /devices/system/cpu/cpu4 
Mar 02 15:23:05 remove /devices/virtual/cpuid/cpu4 
Mar 02 15:23:05 offline /devices/system/cpu/cpu4 

This seems to happen before the agent runs, but it could happen at the same time, so we might end up with a CPU going back from 1 to 0 while the agent is running.

My suggestion is to rework the loop to count number of 1 entries instead.

c3d commented 2 years ago

The CPU4 is one we did not see as properly onlined.

For CPU5, where it worked, the same shows:

Mar 02 15:23:05 add /devices/system/cpu/cpu5 
Mar 02 15:23:05 bind /devices/system/cpu/cpu5 
Mar 02 15:23:05 add /devices/virtual/cpuid/cpu5 
Mar 02 15:23:05 online /devices/system/cpu/cpu5 
Mar 02 15:23:05 remove /devices/virtual/cpuid/cpu5 
Mar 02 15:23:05 offline /devices/system/cpu/cpu5 
Mar 02 15:23:05 add /devices/virtual/cpuid/cpu5 
Mar 02 15:23:05 online /devices/system/cpu/cpu5 
bergwolf commented 2 years ago

IIRC, all the kernel online actions are asynchronous so even after we write 1 to the sysfs file, we still need to wait for the CPUs to actually become online. Not sure if that's what you got here though.