madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

High CPU use in clock_gettime (TSC clocksource unavailable on itscrd03) #116

Closed valassi closed 1 year ago

valassi commented 3 years ago

I wanted to reproduce a few performance numbers, in case we wanted to include some in the vchep paper. So I went back in time to older commits and tried to reproduce some numbers. In particular I went back to commit https://github.com/madgraph5/madgraph4gpu/commit/715d26ffe830c3c9794419c45e86d0c04f474365 because this is the first one where I added some numbers to the logs.

I got very surprised because I was unable to reproduce the c++ throughput on my itscrd03. Eventually, I understood that this node gives a VERY different performance (a factor 3 to 4 slower!) than an lxbatch GPU node (like the one I had used for the performance recorded in the commit log) or my physical node pmpe04.

The CPUs are similar Intel's. I wondered if there was a problem with memory access or swappiness, but it does not look like it. Eventually I played a bit with flamegraphs and found out that itscrd03 seems to spent 70% of the CPU time in clock_gettime, unlike the other nodes.

This is itscrd03

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 3.760869e+00 sec
MeanTimeInWaveFuncs   = 9.564384e-06 sec
StdDevTimeInWaveFuncs = 4.128436e-06 sec
MinTimeInWaveFuncs    = 7.822000e-06 sec
MaxTimeInWaveFuncs    = 6.076190e-04 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 1.045546e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m7.065s
user    0m2.196s
sys     0m4.864s

And this is pmpe04 using the exact same executable (shared on AFS)

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 1.143175e+00 sec
MeanTimeInWaveFuncs   = 2.907244e-06 sec
StdDevTimeInWaveFuncs = 6.344392e-07 sec
MinTimeInWaveFuncs    = 2.530000e-06 sec
MaxTimeInWaveFuncs    = 8.822100e-05 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 3.439683e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m1.777s
user    0m1.711s
sys     0m0.063s

Note that the pmpe throughput 3.5E5 is compatible with that I had recorded on lxbatch of 3.0E5 (a bit higher, it's a physical machine also). Instead that on itscrd03 is a factor 3 lower, which is why I started this investigation.

Here's two flamegraphs (svg converted to png for github):

itscrd03 has 70% of clock_gettime perf-itscrd03

pmpe04 does not have that overhead perf-pmpe04

This is about old code, however, there are two things that it may be useful to understand:

Nothing urgent, but useful to keep in mind...

valassi commented 3 years ago

PS These were generated with

perf record -o perf.data -F 1000 -g -- time ./check.exe 393216
perf script -i perf.data | stackcollapse-perf.pl > perf.fold
flamegraph.pl perf.fold > perf.svg

The flamegraphs have ~2k and ~7k samples because they take 2s and 7s, with a 1000Hz sampling.

(Just to avoid doubts: removing "time" does not make a difference, the throughput remains 1E5 on itscrd03 and the flamegrph still includes clock_gettime, within check.exe).

This is in madgraph4gpu/examples/standalone_cpp/eemumu/SubProcesses/P1_Sigma_sm_epem_mupmum

valassi commented 3 years ago

I wonder if the issue comes from this file https://github.com/madgraph5/madgraph4gpu/blob/715d26ffe830c3c9794419c45e86d0c04f474365/examples/gpu/eemumu/SubProcesses/P1_Sigma_sm_epem_mupmum/timer.h

But it looks very similar to the current one https://github.com/madgraph5/madgraph4gpu/blob/9ebd056d2ff792bd027088526f5f1a870d505826/epoch1/cuda/ee_mumu/SubProcesses/timer.h

We always use it with high resolution closk I think

define TIMERTYPE std::chrono::high_resolution_clock

https://github.com/madgraph5/madgraph4gpu/blob/9ebd056d2ff792bd027088526f5f1a870d505826/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/timermap.h#L11

valassi commented 3 years ago

This suggested to use strace https://stackoverflow.com/a/57761741 Indeed, on itscrd03 I get zillions of clock_gettime calls.

That SO answer suggests "Your clock_gettime is not using VDSO and is a system call instead - will be visible if run under strace, can be configured on modern kernel versions"... to be understood

valassi commented 3 years ago

And note this may well be our std::chrono https://stackoverflow.com/a/50883964 "Normal std::chrono implementations will use an OS-provided function like clock_gettime on Unix. (On Linux, this can run purely in user-space, code + scale factor data in a VDSO page mapped by the kernel into every process's address space. Low-overhead timesources are nice. Avoiding a user->kernel->user round trip helps a lot with Meltdown + Spectre mitigation enabled.)"

valassi commented 3 years ago

Actually, there is one difference: on itscrd03 I am on CentOS8 with the nativce compiler, while on the opther nodes I take gcc8 from cvmfs. However I checked the libraries which are taken from cvmfs using strace, and adding them does not change

LD_PRELOAD=/cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/lib64/libgcc_s.so.1 ./check.exe 393216

This still gives 1E5 on itscrd03, not 3.5E5...

valassi commented 3 years ago

WOW.

I wanted to try and fix the issue on itscrd03 by testing gcc9 instead of gcc8. I got the opposite effect: using gcc9 does NOT speed up itscrd03 (remains at 1E5), but it does speed up pmpe04 (jumps from 3.5E5 to 5.5E5!)...

The difference is in . /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0/x86_64-centos7/setup.sh . /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh

Note that I had chosen gcc8 over gcc9, maybe because this preserves cuda-gdb better... to be cross checked.

Summary: another mistery, why does gcc9 improve so much over gcc8 (and only on on enode)?

Anyway, again... this is old code...

valassi commented 3 years ago

This is pmpe04 with gcc9

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 7.261218e-01 sec
MeanTimeInWaveFuncs   = 1.846623e-06 sec
StdDevTimeInWaveFuncs = 4.289706e-07 sec
MinTimeInWaveFuncs    = 1.564000e-06 sec
MaxTimeInWaveFuncs    = 8.218800e-05 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 5.415289e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m1.378s
user    0m1.305s
sys     0m0.067s
valassi commented 3 years ago

Simlar results as gcc 9 with gcc10 . /cvmfs/sft.cern.ch/lcg/releases/gcc/10.1.0/x86_64-centos7/setup.sh That is, itscrd03 at 1E5, pmpe04 at 5.5E5...

valassi commented 3 years ago

For the record, I wondered if CentOS8 was the issues, but it is not. I tried lxplus8 and it works well there (throughput is higher). I checked that it has exactly the same libraries as the buggy itscrd03, including libc, so it is not libc.so that is the issue.

valassi commented 3 years ago

Anyway, the issue now is much clearer thanks to David!

He reported that << tracing with gdb it seems execution does go via vdso, but between, e.g. itscrd02 and itscrd03 an internal variable ('vclock_mode') is set differently and it forces a fallback to the system call for clock_gettime() in the case of itscrd03. I believe this is to do with the clock source. e.g. on pmpe01 & itscrd02 /sys/devices/system/clocksource/clocksource0/current_clocksource shows 'tsc'. On itscrd03/itscrd04 it is acpi_pm, and on those two latter machines the kernel logged something similar to 'clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:' during boot. One can write to the 'current_clocksource' and probably change it to tsc, but I don't know why the kernel has marked it unstable. I expected itscrd02 and itscrd03 would be very similar, both openstack machines of the same flavour, running the same kernel version and with the same kernel boot parameters. >>

Indeed I checked that: on itscrd03,

dmesg | grep -i tsc
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2100.019 MHz processor
[    0.007003] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e45428f188, max_idle_ns: 440795266913 ns
[    0.013299] TSC deadline timer enabled
[    4.063044] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[    4.063049] clocksource:                       'tsc-early' cs_now: 76958b0d24144 cs_last: 769584b672318 mask: ffffffffffffffff
[    4.063050] tsc: Marking TSC unstable due to clocksource watchdog

cat /sys/devices/system/clocksource/*/available_clocksource
acpi_pm 

while on itscrd02

dmesg | grep -i tsc
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2100.166 MHz processor
[    0.007003] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e45cd446fa, max_idle_ns: 440795212180 ns
[    0.013216] TSC deadline timer enabled
[    0.184142] clocksource: Switched to clocksource tsc-early
[    0.971103] systemd[1]: Set hostname to <itscrd02.cern.ch>.
[    1.631038] tsc: Refined TSC clocksource calibration: 2100.002 MHz
[    1.631049] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e453273c08, max_idle_ns: 440795283667 ns
[    1.631068] clocksource: Switched to clocksource tsc

cat /sys/devices/system/clocksource/*/available_clocksource
tsc acpi_pm 

See also https://serverfault.com/questions/681748/tsc-missing-from-available-clocksources which describes essentially the same issue.

Not clear why TSC is giving issues on this machine. Maybe it's something to do with the virtual machine infrastructure, eg this Dec 2019 Suse/SAP notice https://www.suse.com/support/kb/doc/?id=000019562 mentions "However on some virtualization and/or cloud solutions with kernel version 4.4.178-94.91 or lower, the "tsc" is not available at the aforementioned locations".

valassi commented 3 years ago

Following https://aws.amazon.com/premiumsupport/knowledge-center/manage-ec2-linux-clock-source I modified /etc/default/grub by adding " clocksource=tsc tsc=reliable" at the end of GRUB_CMDLINE_LINUX. Then I have run "grub2-mkconfig -o /boot/grub2/grub.cfg" and rebooted. (David also suggested "grubby --update-kernel=ALL --args="tsc=reliable" as an alternative).

But it is still not enough,

dmesg | grep -i tsc
[    0.000000] Command line: BOOT_IMAGE= ... clocksource=tsc tsc=reliable
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2100.055 MHz processor
[    0.000000] Kernel command line: BOOT_IMAGE= ... clocksource=tsc tsc=reliable
[    0.001000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e45649026e, max_idle_ns: 440795259094 ns
[    0.001000] TSC deadline timer enabled
[    4.019019] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[    4.019025] clocksource:                       'tsc-early' cs_now: 4208573b47b76a cs_last: 42085716e25bb4 mask: ffffffffffffffff
[    4.019025] tsc: Marking TSC unstable due to clocksource watchdog

I will open a ticket on our VM infrastructure, and probably rebuild the node on another hypervisor...

valassi commented 3 years ago

Maybe 'tsc=reliable' is not working because of this bug in the linux kernel? (Oct 2019)

http://lkml.iu.edu/hypermail/linux/kernel/1910.0/01948.html

"Introduction of clocksource_tsc_early broke functionality of "tsc=reliable" and "tsc=nowatchdog" boot params, since clocksource_tsc_early is always registered with CLOCK_SOURCE_MUST_VERIFY and thus put on the watchdog list.

If CPU is very busy during boot, the watchdog clocksource may not be read frequently enough, resulting in a large difference that is treated as "negative" by clocksource_delta() and incorrectly truncated to 0.

This false alarm causes TSC to be declared unstable: ..."

valassi commented 3 years ago

Just for the record, I am now using itscrd70 on Centos7 and I do not have this issue. But the underlying issue on other (not all) centos8 vms is not solved yet.

valassi commented 1 year ago

I am closing this very old issue.

First, I did find one of our nodes that has no TSC (itscrd-a100 on centos8). Then I have run eemumu through perf, and it does not seem to have any problem with system calls. So I guess this is no longer a problem even when no TSC is found?

[avalassi@itscrd-a100 gcc11.2/cvmfs] ~> cat /etc/redhat-release; cat /sys/devices/system/clocksource/*/available_clocksource
CentOS Stream release 8
acpi_pm 
[avalassi@itscrd-a100 gcc11.2/cvmfs] ~> dmesg | grep -i tsc
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 3000.028 MHz processor
[    0.000000] TSC deadline timer available
[    0.001000] tsc: Marking TSC unstable due to TSCs unsynchronized
[    1.689903] systemd[1]: Set hostname to <itscrd-a100.cern.ch>.
[    3.465207] systemd[1]: Set hostname to <itscrd-a100.cern.ch>.
...
[avalassi@itscrd-a100 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_ll_ll> DEBUGINFOD_PROGRESS=1 perf record -o perf.data -F 1000 ./check.exe -p 256 256 100^C
...
[avalassi@itscrd-a100 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_ll_ll> perf report --stdio^C
...
    16.26%  check.exe  libm-2.28.so            [.] __ieee754_log_fma
    14.05%  check.exe  libmg5amc_ll_ll_cpp.so  [.] mg5amcCpu::FFV2_4_3<KernelAc>
    12.97%  check.exe  check.exe               [.] mg5amcCpu::ramboGetMomentaFi>
    11.67%  check.exe  libm-2.28.so            [.] __sincos
    11.56%  check.exe  libmg5amc_ll_ll_cpp.so  [.] mg5amcCpu::calculate_wavefun>
    11.48%  check.exe  libcurand.so.10.3.0.86  [.] curandGenerateUniformDouble
     5.59%  check.exe  libmg5amc_ll_ll_cpp.so  [.] mg5amcCpu::FFV1_0<KernelAcce>
     5.42%  check.exe  libmg5amc_ll_ll_cpp.so  [.] mg5amcCpu::FFV2_4_0<KernelAc>
     4.13%  check.exe  libmg5amc_ll_ll_cpp.so  [.] mg5amcCpu::CrossSectionKerne>
     2.60%  check.exe  libmg5amc_ll_ll_cpp.so  [.] mg5amcCpu::sigmaKin
     1.24%  check.exe  check.exe               [.] mg5amcCpu::RamboSamplingKern>
     0.92%  check.exe  check.exe               [.] mg5amcCpu::RamboSamplingKern>
     0.37%  check.exe  libc-2.28.so            [.] __memset_avx2_unaligned_erms

Second, I have just installed itscrd80 on Alma8, which will be my new reference node. This has TSC anyway

[avalassi@itscrd80 bash] ~> cat /etc/redhat-release; cat /sys/devices/system/clocksource/*/available_clocksource
AlmaLinux release 8.7 (Stone Smilodon)
tsc acpi_pm 
[avalassi@itscrd80 bash] ~> dmesg  | grep -i tsc
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2099.919 MHz processor
[    0.000000] TSC deadline timer available
[    0.001000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e44e3ef694, max_idle_ns: 440795256857 ns
[    0.248185] clocksource: Switched to clocksource tsc-early
[    1.759043] tsc: Refined TSC clocksource calibration: 2099.992 MHz
[    1.759057] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e45290e7fd, max_idle_ns: 440795303772 ns
[    1.759089] clocksource: Switched to clocksource tsc
[    2.827358] systemd[1]: Set hostname to <itscrd80.cern.ch>.

So all in all this can be closed

PS(1) By the way, note that perf now fills up ~/.debug with cache. I used this trick to move it out of AFS. Note also that perf record hangs or rather perf record seems to hang while populating the buildid cache

more ~/.perfconfig
# (2023.02.13) avoid filling up ~/.debug on AFS!
# See https://unix.stackexchange.com/questions/669723/what-creates-debug-folder
# See https://perf.wiki.kernel.org/index.php/Tutorial#The_build-id_cache
[buildid]
        # Default is ~/.debug, disable using /dev/null
        dir = /data/avalassi/PERF_BUILDID

PS(2) A related point, perf also seems to hang because the buildid cache also relies on the debuginfod cache (so it populates ~/.cache/debuginfod_client). See https://wiki.debian.org/Debuginfod https://lore.kernel.org/lkml/YbUCjKYiLBgnDqq8@krava/T/ https://fedoraproject.org/wiki/Debuginfod To disable this (centos8 default is DEBUGINFOD_URLS=https://debuginfod.centos.org):

unset DEBUGINFOD_URLS

To change the cache location

export DEBUGINFOD_CACHE_PATH=/data/avalassi/DEBUGINFOD_CACHE

or even better

export XDG_CACHE_HOME=/data/avalassi/XDG_CACHE

And to display the progress in perf

DEBUGINFOD_PROGRESS=1 perf record 
valassi commented 2 months ago

Note: I finally developed some rdtsc based counters in any case, bypassing std::chrono clocks #972