Open c3d opened 2 years ago
I believe this could be the root cause for #3793 , where we see some CPUs staying off-line.
@c3d Don't break the retry loop and give it more retries is not bad at least.
But I checked the online_resources
, should this maybe a cause?
line 407 fs::write(file.as_str(), "1")
may return Ok, but the content of that file could be still 0
? In that case line 411 count += 1
will count a not onlined CPU as an online CPU.
So check the file content after writing and add the counter when the content really becomes 1
will be acceptable.
line 407 fs::write(file.as_str(), "1") may return Ok, but the content of that file could be still 0? In that case line 411 count += 1 will count a not onlined CPU as an online CPU.
We have logs that seem to show that the boot sequence indeed supports that situation. However, it may happen "long after" we wrote a one.
The uevents we see are this (cleaned up):
Mar 02 15:23:04 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1226", interface: "" }"}"
Mar 02 15:23:04 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1227", interface: "" }"}
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1228", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1229", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1230", interface: "" }"}
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1231", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu4", devname: "", subsystem: "cpu", seqnum: "1232", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu4", devname: "", subsystem: "cpu", seqnum: "1233", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1234", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1235", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/system/cpu/cpu6", devname: "", subsystem: "cpu", seqnum: "1236", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "bind", devpath: "/devices/system/cpu/cpu6", devname: "", subsystem: "cpu", seqnum: "1237", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu5", devname: "cpu/5/cpuid", subsystem: "cpuid", seqnum: "1239", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1242", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu5", devname: "cpu/5/cpuid", subsystem: "cpuid", seqnum: "1244", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1246", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu2", devname: "cpu/2/cpuid", subsystem: "cpuid", seqnum: "1248", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1251", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu2", devname: "cpu/2/cpuid", subsystem: "cpuid", seqnum: "1253", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1255", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu3", devname: "cpu/3/cpuid", subsystem: "cpuid", seqnum: "1257", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1260", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu1", devname: "cpu/1/cpuid", subsystem: "cpuid", seqnum: "1262", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1265", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu2", devname: "cpu/2/cpuid", subsystem: "cpuid", seqnum: "1267", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu2", devname: "", subsystem: "cpu", seqnum: "1269", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu1", devname: "cpu/1/cpuid", subsystem: "cpuid", seqnum: "1271", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1273", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu6", devname: "cpu/6/cpuid", subsystem: "cpuid", seqnum: "1275", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu6", devname: "", subsystem: "cpu", seqnum: "1278", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu6", devname: "cpu/6/cpuid", subsystem: "cpuid", seqnum: "1280", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu6", devname: "", subsystem: "cpu", seqnum: "1282", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu4", devname: "cpu/4/cpuid", subsystem: "cpuid", seqnum: "1284", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu4", devname: "", subsystem: "cpu", seqnum: "1287", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu4", devname: "cpu/4/cpuid", subsystem: "cpuid", seqnum: "1289", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu4", devname: "", subsystem: "cpu", seqnum: "1291", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "remove", devpath: "/devices/virtual/cpuid/cpu3", devname: "cpu/3/cpuid", subsystem: "cpuid", seqnum: "1293", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "offline", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1295", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu5", devname: "cpu/5/cpuid", subsystem: "cpuid", seqnum: "1297", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu5", devname: "", subsystem: "cpu", seqnum: "1299", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu3", devname: "cpu/3/cpuid", subsystem: "cpuid", seqnum: "1301", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu3", devname: "", subsystem: "cpu", seqnum: "1303", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "add", devpath: "/devices/virtual/cpuid/cpu1", devname: "cpu/1/cpuid", subsystem: "cpuid", seqnum: "1305", interface: "" }"}"
Mar 02 15:23:05 :"Uevent { action: "online", devpath: "/devices/system/cpu/cpu1", devname: "", subsystem: "cpu", seqnum: "1307", interface: "" }"}"
So for example, for cpu4, we have this (even more cleaned up):
Mar 02 15:23:05 add /devices/system/cpu/cpu4
Mar 02 15:23:05 bind /devices/system/cpu/cpu4
Mar 02 15:23:05 add /devices/virtual/cpuid/cpu4
Mar 02 15:23:05 online /devices/system/cpu/cpu4
Mar 02 15:23:05 remove /devices/virtual/cpuid/cpu4
Mar 02 15:23:05 offline /devices/system/cpu/cpu4
This seems to happen before the agent runs, but it could happen at the same time, so we might end up with a CPU going back from 1 to 0 while the agent is running.
My suggestion is to rework the loop to count number of 1
entries instead.
The CPU4 is one we did not see as properly onlined.
For CPU5, where it worked, the same shows:
Mar 02 15:23:05 add /devices/system/cpu/cpu5
Mar 02 15:23:05 bind /devices/system/cpu/cpu5
Mar 02 15:23:05 add /devices/virtual/cpuid/cpu5
Mar 02 15:23:05 online /devices/system/cpu/cpu5
Mar 02 15:23:05 remove /devices/virtual/cpuid/cpu5
Mar 02 15:23:05 offline /devices/system/cpu/cpu5
Mar 02 15:23:05 add /devices/virtual/cpuid/cpu5
Mar 02 15:23:05 online /devices/system/cpu/cpu5
IIRC, all the kernel online
actions are asynchronous so even after we write 1
to the sysfs file, we still need to wait for the CPUs to actually become online. Not sure if that's what you got here though.
The agent's code right now onlines CPUs as follows:
I believe that the
r?
to add toonlined_count
is incorrect, since that means an error fromonline_resources
interrupts the retry loop.@liubin You introduced that in bcce1a1911fea04b4c19ad8f3d10cb3a3e423391. Would you agree that it would be better to use
is_err()
to avoid exiting the function in case of error, but instead retry?