PQOS race condition during system startup

gsauthof commented 3 years ago

I'm running pqos from a telegraf plugin which sometimes leads to pqos consuming 100 % after boot and a high CAL interrupt rate on all monitored cores.

So perhaps every 4th system boot or so this happens (under RHEL 7.9). The situation can then be cured by simply killing the pqos process and restarting telegraf.

Thus, it looks like there is some race-condition during system boot, e.g. some service that is required by pqos which is started concurrently to the telegraf systemd unit.

I'm happy to add a systemd dependency to telegraf on some other service.

So what services should have been started/completed before the pqos utility can be safely started during boot?

The telegraf plugin starts pqos like this:

pqos -r --iface-os --mon-file-type=csv --mon-interval=INTERVAL --mon-core=all:[CORES]\;mbt:[CORES]

kmabbasi commented 3 years ago

Hi,

Can you please let us know what is the interval value used for monitoring?

Thanks, Khawar

gsauthof commented 3 years ago

I'm using an 1 s interval (i.e. 10 * 100 ms which is also the default of that telegraf plugin).

kmabbasi commented 3 years ago

Would it be possible to reproduce it with pqos only to verify that this is an issue with pqos utility and not telegraph?

Thanks, Khawar

gsauthof commented 3 years ago

Would it be possible to reproduce it with pqos only to verify that this is an issue with pqos utility and not telegraph?

Probably, I could try to directly start it from a simple systemd service, during boot. And then reboot a few times. This should trigger it, as well.

But I can try that earliest next week.

When I can reproduce it like that - is there some information I should gather from the system/the pqos process?

kmabbasi commented 3 years ago

You can try to reproduce pqos with linux kernel (resctrl) and MSR interface, to see whether both have same behavior or not.

You can also see kernel logs for more details if resctrl have some issues and share logs with us.

gsauthof commented 3 years ago

Ok, directly starting pqos from an enabled systemd service actually makes it easier to reproduce this issue, i.e. that way I can reproduce it on every reboot, so far.

I can reproduce it with --iface-os and without (i.e. it uses the MSR interface then).

Some details:

I built pqos from this repository, i.e. my local clone is at:

commit f7ae36b4cde629ec422b0889a6a238431d75f5e1 (HEAD -> master, origin/master, origin/HEAD)
Date:   Mon Jul 5 12:22:30 2021 +0000

In the standalone service I'm calling pqos like this:

/opt/intel-cmt-cat/bin/pqos -r --iface-os --mon-file-type=csv --mon-interval=10 --mon-core=all:[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23];mbt:[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23];

That means it's the exact same command that is also constructed and executed by the telegraf plugin.

After a reboot the process uses 100 % CPU - top output:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2137 root      20   0    8824   1096    912 R  99.3  0.0   4:51.60 pqos

(host CPU: Xeon Gold 6144)

There is a high rate of function call interrupts - irqtop output:

                  IRQ                 TOTAL                DELTA NAME
                  CAL              22208104               183955 Function call interrupts

The kernel logs are clean, i.e. no messages/errors related to resctrl/pqos.

A short strace session shows a high rate of certain syscalls:

$ timeout 10 strace -o pqos.log -y  -f -p $(pidof pqos)
$ sort pqos.log | uniq -c | sort -k1,1 -n
[..]
    443 2137  open("/sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_00/mbm_local_bytes", O_RDONLY) = 71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_00/mbm_local_bytes>
    443 2137  open("/sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_00/mbm_total_bytes", O_RDONLY) = 71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_00/mbm_total_bytes>
    443 2137  open("/sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_01/mbm_local_bytes", O_RDONLY) = 71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_01/mbm_local_bytes>
    443 2137  open("/sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_01/mbm_total_bytes", O_RDONLY) = 71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_01/mbm_total_bytes>
    884 2137  fstat(71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_00/llc_occupancy>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
    884 2137  fstat(71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_01/llc_occupancy>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
    886 2137  fstat(71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_00/mbm_local_bytes>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
    886 2137  fstat(71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_00/mbm_total_bytes>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
    886 2137  fstat(71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_01/mbm_local_bytes>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
    886 2137  fstat(71</sys/fs/resctrl/mon_groups/pqos-2137-0/mon_data/mon_L3_01/mbm_total_bytes>, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
   1328 2137  stat("/sys/fs/resctrl/COS1/mon_groups/pqos-2137-0", 0x7ffe3e17aa80) = -1 ENOENT (No such file or directory)
   1328 2137  stat("/sys/fs/resctrl/COS2/mon_groups/pqos-2137-0", 0x7ffe3e17aa80) = -1 ENOENT (No such file or directory)
   1328 2137  stat("/sys/fs/resctrl/COS3/mon_groups/pqos-2137-0", 0x7ffe3e17aa80) = -1 ENOENT (No such file or directory)
   1328 2137  stat("/sys/fs/resctrl/COS4/mon_groups/pqos-2137-0", 0x7ffe3e17aa80) = -1 ENOENT (No such file or directory)
   1328 2137  stat("/sys/fs/resctrl/COS5/mon_groups/pqos-2137-0", 0x7ffe3e17aa80) = -1 ENOENT (No such file or directory)
   1328 2137  stat("/sys/fs/resctrl/COS6/mon_groups/pqos-2137-0", 0x7ffe3e17aa80) = -1 ENOENT (No such file or directory)
   1328 2137  stat("/sys/fs/resctrl/COS7/mon_groups/pqos-2137-0", 0x7ffe3e17aa80) = -1 ENOENT (No such file or directory)
   2656 2137  stat("/sys/fs/resctrl/mon_groups/pqos-2137-0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
  29196 2137  lstat("/sys/fs/resctrl/cpus", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
  29197 2137  close(71</sys/fs/resctrl/cpus>)   = 0
  29197 2137  close(71</sys/fs/resctrl/mon_groups>) = 0
  29197 2137  getdents(71</sys/fs/resctrl/mon_groups>, /* 0 entries */, 32768) = 0
  29197 2137  getdents(71</sys/fs/resctrl/mon_groups>, /* 3 entries */, 32768) = 80
  29197 2137  openat(AT_FDCWD, "/sys/fs/resctrl/mon_groups/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 71</sys/fs/resctrl/mon_groups>
  29197 2137  open("/sys/fs/resctrl/cpus", O_RDONLY) = 71</sys/fs/resctrl/cpus>
  29197 2137  read(71</sys/fs/resctrl/cpus>, "000000,00000000,00000000,0000000"..., 4096) = 70
  29197 2137  read(71</sys/fs/resctrl/cpus>, "", 4096) = 0
  30525 2137  close(71</sys/fs/resctrl/mon_groups/pqos-2137-0/cpus>) = 0
  30525 2137  lstat("/sys/fs/resctrl/mon_groups/pqos-2137-0/cpus", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
  30525 2137  open("/sys/fs/resctrl/mon_groups/pqos-2137-0/cpus", O_RDONLY) = 71</sys/fs/resctrl/mon_groups/pqos-2137-0/cpus>
  30525 2137  read(71</sys/fs/resctrl/mon_groups/pqos-2137-0/cpus>, "000000,00000000,00000000,0000000"..., 4096) = 70
  30525 2137  read(71</sys/fs/resctrl/mon_groups/pqos-2137-0/cpus>, "", 4096) = 0
  58394 2137  fstat(71</sys/fs/resctrl/cpus>, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
  61050 2137  fstat(71</sys/fs/resctrl/mon_groups/pqos-2137-0/cpus>, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
  62378 2137  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3eca11d000
  62378 2137  munmap(0x7f3eca11d000, 4096)  = 0

That means e.g. /sys/fs/resctrl/cpus is fstat-ed ~ 6000 times per second!

resctrl is mounted like this:

mount | grep resctrl
resctrl on /sys/fs/resctrl type resctrl (rw,relatime)

The minimal systemd service file (enabled instead of the existing telegraf.sevcice):

$ cat /etc/systemd/system/telegraftest.service
[Unit]
Description=Tests PQOS issue
Documentation=https://github.com/influxdata/telegraf
After=network.target

[Service]
ExecStart=/opt/intel-cmt-cat/bin/pqos -r --iface-os --mon-file-type=csv --mon-interval=10 --mon-core=all:[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23];mbt:[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23];
StandardOutput=null

[Install]
WantedBy=multi-user.target

Instructions to reproduce:

systemctl daemon-reload
systemctl enable telegraftest
reboot

When I modify the ExecStart= command such that the MSR interface is used, i.e.:

/opt/intel-cmt-cat/bin/pqos -r --mon-file-type=csv --mon-interval=10 --mon-core=all:[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23];mbt:[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23];

After a reboot I'm still seeing 100 % CPU usage:

   PID USER  PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  2137 root  20   0    8824    944    772 R 100.0  0.0   2:00.25 pqos

Still high interrupt rate, somewhat lower than before:

                  IRQ                 TOTAL                DELTA NAME
                  CAL              80941845              1524946 Function call interrupts

10 second strace syscall histogram looks different but still high rates:

[..]
    100 2137  pread64(5</dev/cpu/12/msr>, "\20\v\0\0\0\0\0\0", 8, 3214) = 8
    100 2137  pread64(5</dev/cpu/12/msr>, "\r\v\0\0\0\0\0\0", 8, 3214) = 8
    104 2137  pread64(5</dev/cpu/12/msr>, "7\v\0\0\0\0\0\0", 8, 3214) = 8
    117 2137  pread64(5</dev/cpu/12/msr>, "-\v\0\0\0\0\0\0", 8, 3214) = 8
    120 2137  pread64(7</dev/cpu/2/msr>, "\4\0\0\0\0\0\0\0", 8, 3214) = 8
    134 2137  pread64(5</dev/cpu/12/msr>, ".\v\0\0\0\0\0\0", 8, 3214) = 8
    144 2137  pread64(5</dev/cpu/12/msr>, "\v\v\0\0\0\0\0\0", 8, 3214) = 8
    149 2137  pread64(5</dev/cpu/12/msr>, "\16\v\0\0\0\0\0\0", 8, 3214) = 8
    149 2137  pread64(5</dev/cpu/12/msr>, ";\v\0\0\0\0\0\0", 8, 3214) = 8
    149 2137  pread64(5</dev/cpu/12/msr>, "+\v\0\0\0\0\0\0", 8, 3214) = 8
    151 2137  pread64(5</dev/cpu/12/msr>, "\35\v\0\0\0\0\0\0", 8, 3214) = 8
    154 2137  pread64(5</dev/cpu/12/msr>, "\26\v\0\0\0\0\0\0", 8, 3214) = 8
    156 2137  pread64(5</dev/cpu/12/msr>, "\f\v\0\0\0\0\0\0", 8, 3214) = 8
    160 2137  pread64(7</dev/cpu/2/msr>, "\347\t\0\0\0\0\0\0", 8, 3214) = 8
    196 2137  pread64(5</dev/cpu/12/msr>, "'\v\0\0\0\0\0\0", 8, 3214) = 8
    197 2137  pread64(5</dev/cpu/12/msr>, "%\v\0\0\0\0\0\0", 8, 3214) = 8
    198 2137  pread64(5</dev/cpu/12/msr>, "6\v\0\0\0\0\0\0", 8, 3214) = 8
    199 2137  pread64(5</dev/cpu/12/msr>, "\6\v\0\0\0\0\0\0", 8, 3214) = 8
    219 2137  pread64(5</dev/cpu/12/msr>, "$\v\0\0\0\0\0\0", 8, 3214) = 8
    248 2137  pread64(5</dev/cpu/12/msr>, "\10\v\0\0\0\0\0\0", 8, 3214) = 8
    370 2137  pread64(5</dev/cpu/12/msr>, "9\v\0\0\0\0\0\0", 8, 3214) = 8
    394 2137  pread64(5</dev/cpu/12/msr>, "\"\v\0\0\0\0\0\0", 8, 3214) = 8
    396 2137  pread64(5</dev/cpu/12/msr>, "(\v\0\0\0\0\0\0", 8, 3214) = 8
    473 2137  pread64(5</dev/cpu/12/msr>, "#\v\0\0\0\0\0\0", 8, 3214) = 8
    497 2137  pread64(7</dev/cpu/2/msr>, "\351\t\0\0\0\0\0\0", 8, 3214) = 8
    543 2137  pread64(5</dev/cpu/12/msr>, "!\v\0\0\0\0\0\0", 8, 3214) = 8
    646 2137  pread64(7</dev/cpu/2/msr>, "\7\0\0\0\0\0\0\0", 8, 3214) = 8
    794 2137  pread64(5</dev/cpu/12/msr>, "4\v\0\0\0\0\0\0", 8, 3214) = 8
    826 2137  pread64(5</dev/cpu/12/msr>, "!\1\0\0\0\0\0\0", 8, 3214) = 8
   2231 2137  pread64(7</dev/cpu/2/msr>, "\353\t\0\0\0\0\0\0", 8, 3214) = 8
   2355 2137  pread64(7</dev/cpu/2/msr>, "\352\t\0\0\0\0\0\0", 8, 3214) = 8
   2684 2137  pread64(7</dev/cpu/2/msr>, "\350\t\0\0\0\0\0\0", 8, 3214) = 8
   2844 2137  pread64(7</dev/cpu/2/msr>, "=\t\0\0\0\0\0\0", 8, 3214) = 8
   3519 2137  pread64(7</dev/cpu/2/msr>, "\5\0\0\0\0\0\0\0", 8, 3214) = 8
   3637 2137  pread64(7</dev/cpu/2/msr>, "\6\0\0\0\0\0\0\0", 8, 3214) = 8
   5083 2137  pread64(7</dev/cpu/2/msr>, ">\t\0\0\0\0\0\0", 8, 3214) = 8
   7101 2137  pread64(5</dev/cpu/12/msr>, "\"\1\0\0\0\0\0\0", 8, 3214) = 8
   7926 2137  fcntl(3</run/lock/libpqos>, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_CUR, l_start=0, l_len=0}) = 0
   7926 2137  fcntl(3</run/lock/libpqos>, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_CUR, l_start=0, l_len=0}) = 0
   7926 2137  pwrite64(5</dev/cpu/12/msr>, "\1\0\0\0\1\0\0\0", 8, 3213) = 8
   7926 2137  pwrite64(7</dev/cpu/2/msr>, "\1\0\0\0\1\0\0\0", 8, 3213) = 8
   7926 2137  stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0
   7927 2137  pwrite64(5</dev/cpu/12/msr>, "\2\0\0\0\1\0\0\0", 8, 3213) = 8
   7927 2137  pwrite64(5</dev/cpu/12/msr>, "\3\0\0\0\1\0\0\0", 8, 3213) = 8
   7927 2137  pwrite64(7</dev/cpu/2/msr>, "\2\0\0\0\1\0\0\0", 8, 3213) = 8
   7927 2137  pwrite64(7</dev/cpu/2/msr>, "\3\0\0\0\1\0\0\0", 8, 3213) = 8

As mentioned before, it's possible to kill that misbehaving pqos process and when started again, it doesn't show this pathological behavior.

Just adding a sleep before the pqos start during boot helps, i.e. when adding

[Service]
ExecStartPre=/bin/sleep 10

I can't reproduce this issue anymore.

mdcornu commented 3 years ago

Hi,

Thanks for getting back, this is very useful info. Can you tell us what kernel version you are using? The library uses linux perf to provide IPC and LLC misses metrics. One more thing you could try, just eliminate perf as the cause, would be to add "--disable-mon-ipc --disable-mon-llc_miss" to the pqos cmd and check if the issue is still present.

Thanks, Marcel

gsauthof commented 3 years ago

I'm using a stock RHEL Kernel: 3.10.0-1160.25.1.el7.x86_64

When removing any sleep and adding --disable-mon-ipc --disable-mon-llc_miss I can't reproduce the issue anymore.

kmabbasi commented 3 years ago

looks like an issue with perf. Do you need these flags for Telegraph? These flags disable perf call. If you don't need IPC or LLC misses metrics then you can leave them disabled.

Thanks, Khawar

gsauthof commented 3 years ago

Yes, I need those metrics. I think that when monitoring LLC occupancy also monitoring the LLC miss rate is important as both paint a more complete picture.

Since the unusually high number of syscalls (like statting a file 6000 times a second) are issued by userspace, i.e. by the pqos process (either directly or via a library) - can't you just fix the program logic that yields those syscalls?

I mean it looks like that code don't checks for some error condition. At least it could detect that it issues syscalls way to often - and in that case reinitialize its perf part again - in a similar way as it happens when I restart the process, externally.

kmabbasi commented 3 years ago

Thanks for the feedback.

We are looking into it and will come back to you when we have update.

Regards, Khawar

gsauthof commented 2 years ago

I found the root cause for the excessive amount of syscalls (and 100 % CPU usage) when pqos is started during boot.

Looking at the main monitoring loop in version 4.2.0 (i.e. monitor_loop() in pqos/monitor.c) I noticed that your nanosleep() call at the end of an iteration is guarded in order to account for the time lost for doing the actual measurement:

https://github.com/intel/intel-cmt-cat/blob/185e61d02ee288975d405ca8baff3dfee601f373/pqos/monitor.c#L2808-L2834

Ok, this is one way to it.

Anyway, looking at my strace logs, I see that nanosleep() isn't included in the listing of the most called syscalls. It isn't even included in the strace log, at all, i.e. it's never called. This either means that the monitoring loop is never entered or that the guard never evaluates true.

The way you implement the time calculations and guard the nanosleep() doesn't expect clock changes. Your code uses the obsolete gettimeofday() function which basically uses the same time source as clock_gettime(CLOCK_REALTIME)- and thus it's also affected by non-monotonic time changes. See also:

  The time returned by gettimeofday() is affected by discontinuous jumps
  in the system time (e.g., if the system  administrator  manually  changes
  the  system time).  If you need a monotonically increasing clock, see clock_gettime(2).

(https://manpath.be/c7/2/gettimeofday#L69)

So we are running our servers with PTP time synchronization services. And during system-boot the CLOCK_REALTIMEmay be stepped by the PTP-daemon in order to speed up clock synchronization. After the boot is finished the daemon isn't stepped anymore, i.e. further time adjustments are made by slowing down/speeding up the clock.

So now there is race during boot between the clock-stepping and the pqos start, i.e. the

https://github.com/intel/intel-cmt-cat/blob/185e61d02ee288975d405ca8baff3dfee601f373/pqos/monitor.c#L2725

gettimeofday() call which happens right before the monitoring loop.

In case the clock is stepped backed some microseconds, then usec_diff can easily be always negative for the rest of the program runtime. Or it can be always greater than the interval. Because you have copied tv_start to tv_s, and never re-initialize tv_s, you just add the interval to it, while looping:

https://github.com/intel/intel-cmt-cat/blob/185e61d02ee288975d405ca8baff3dfee601f373/pqos/monitor.c#L2837

You changed the structure of the monitoring loop in version 4.3.0:

https://github.com/intel/intel-cmt-cat/blob/5da8fd56eb16846e95b334f83761c4567054ac07/pqos/monitor.c#L2754-L2771

However, this still isn't robust against system clock jumps.

So my suggestion for fixing this is to use the Linux TimerFD-API instead of doing custom time accounting. That means creating a periodically expiring TimerFD, where the period is set to the measurement interval. Then you just need to read() from that fd, no need to implement some tedious and error-prone remaining time computations etc. So the TimerFD API really simplifies things.

intel / intel-cmt-cat

PQOS race condition during system startup #194