Closed valassi closed 1 year ago
PS These were generated with
perf record -o perf.data -F 1000 -g -- time ./check.exe 393216
perf script -i perf.data | stackcollapse-perf.pl > perf.fold
flamegraph.pl perf.fold > perf.svg
The flamegraphs have ~2k and ~7k samples because they take 2s and 7s, with a 1000Hz sampling.
(Just to avoid doubts: removing "time" does not make a difference, the throughput remains 1E5 on itscrd03 and the flamegrph still includes clock_gettime, within check.exe).
This is in madgraph4gpu/examples/standalone_cpp/eemumu/SubProcesses/P1_Sigma_sm_epem_mupmum
I wonder if the issue comes from this file https://github.com/madgraph5/madgraph4gpu/blob/715d26ffe830c3c9794419c45e86d0c04f474365/examples/gpu/eemumu/SubProcesses/P1_Sigma_sm_epem_mupmum/timer.h
But it looks very similar to the current one https://github.com/madgraph5/madgraph4gpu/blob/9ebd056d2ff792bd027088526f5f1a870d505826/epoch1/cuda/ee_mumu/SubProcesses/timer.h
We always use it with high resolution closk I think
This suggested to use strace https://stackoverflow.com/a/57761741 Indeed, on itscrd03 I get zillions of clock_gettime calls.
That SO answer suggests "Your clock_gettime is not using VDSO and is a system call instead - will be visible if run under strace, can be configured on modern kernel versions"... to be understood
And note this may well be our std::chrono https://stackoverflow.com/a/50883964 "Normal std::chrono implementations will use an OS-provided function like clock_gettime on Unix. (On Linux, this can run purely in user-space, code + scale factor data in a VDSO page mapped by the kernel into every process's address space. Low-overhead timesources are nice. Avoiding a user->kernel->user round trip helps a lot with Meltdown + Spectre mitigation enabled.)"
Actually, there is one difference: on itscrd03 I am on CentOS8 with the nativce compiler, while on the opther nodes I take gcc8 from cvmfs. However I checked the libraries which are taken from cvmfs using strace, and adding them does not change
LD_PRELOAD=/cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/lib64/libgcc_s.so.1 ./check.exe 393216
This still gives 1E5 on itscrd03, not 3.5E5...
WOW.
I wanted to try and fix the issue on itscrd03 by testing gcc9 instead of gcc8. I got the opposite effect: using gcc9 does NOT speed up itscrd03 (remains at 1E5), but it does speed up pmpe04 (jumps from 3.5E5 to 5.5E5!)...
The difference is in . /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0/x86_64-centos7/setup.sh . /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
Note that I had chosen gcc8 over gcc9, maybe because this preserves cuda-gdb better... to be cross checked.
Summary: another mistery, why does gcc9 improve so much over gcc8 (and only on on enode)?
Anyway, again... this is old code...
This is pmpe04 with gcc9
time ./check.exe 393216
***********************************
NumberOfEntries = 393216
TotalTimeInWaveFuncs = 7.261218e-01 sec
MeanTimeInWaveFuncs = 1.846623e-06 sec
StdDevTimeInWaveFuncs = 4.289706e-07 sec
MinTimeInWaveFuncs = 1.564000e-06 sec
MaxTimeInWaveFuncs = 8.218800e-05 sec
-----------------------------------
NumMatrixElements = 393216
MatrixElementsPerSec = 5.415289e+05 sec^-1
***********************************
NumMatrixElements = 393216
MeanMatrixElemValue = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue = 6.071582e-03 GeV^0
MaxMatrixElemValue = 3.374887e-02 GeV^0
real 0m1.378s
user 0m1.305s
sys 0m0.067s
Simlar results as gcc 9 with gcc10 . /cvmfs/sft.cern.ch/lcg/releases/gcc/10.1.0/x86_64-centos7/setup.sh That is, itscrd03 at 1E5, pmpe04 at 5.5E5...
For the record, I wondered if CentOS8 was the issues, but it is not. I tried lxplus8 and it works well there (throughput is higher). I checked that it has exactly the same libraries as the buggy itscrd03, including libc, so it is not libc.so that is the issue.
Anyway, the issue now is much clearer thanks to David!
He reported that << tracing with gdb it seems execution does go via vdso, but between, e.g. itscrd02 and itscrd03 an internal variable ('vclock_mode') is set differently and it forces a fallback to the system call for clock_gettime() in the case of itscrd03. I believe this is to do with the clock source. e.g. on pmpe01 & itscrd02 /sys/devices/system/clocksource/clocksource0/current_clocksource shows 'tsc'. On itscrd03/itscrd04 it is acpi_pm, and on those two latter machines the kernel logged something similar to 'clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:' during boot. One can write to the 'current_clocksource' and probably change it to tsc, but I don't know why the kernel has marked it unstable. I expected itscrd02 and itscrd03 would be very similar, both openstack machines of the same flavour, running the same kernel version and with the same kernel boot parameters. >>
Indeed I checked that: on itscrd03,
dmesg | grep -i tsc
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] tsc: Detected 2100.019 MHz processor
[ 0.007003] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e45428f188, max_idle_ns: 440795266913 ns
[ 0.013299] TSC deadline timer enabled
[ 4.063044] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[ 4.063049] clocksource: 'tsc-early' cs_now: 76958b0d24144 cs_last: 769584b672318 mask: ffffffffffffffff
[ 4.063050] tsc: Marking TSC unstable due to clocksource watchdog
cat /sys/devices/system/clocksource/*/available_clocksource
acpi_pm
while on itscrd02
dmesg | grep -i tsc
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] tsc: Detected 2100.166 MHz processor
[ 0.007003] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e45cd446fa, max_idle_ns: 440795212180 ns
[ 0.013216] TSC deadline timer enabled
[ 0.184142] clocksource: Switched to clocksource tsc-early
[ 0.971103] systemd[1]: Set hostname to <itscrd02.cern.ch>.
[ 1.631038] tsc: Refined TSC clocksource calibration: 2100.002 MHz
[ 1.631049] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e453273c08, max_idle_ns: 440795283667 ns
[ 1.631068] clocksource: Switched to clocksource tsc
cat /sys/devices/system/clocksource/*/available_clocksource
tsc acpi_pm
See also https://serverfault.com/questions/681748/tsc-missing-from-available-clocksources which describes essentially the same issue.
Not clear why TSC is giving issues on this machine. Maybe it's something to do with the virtual machine infrastructure, eg this Dec 2019 Suse/SAP notice https://www.suse.com/support/kb/doc/?id=000019562 mentions "However on some virtualization and/or cloud solutions with kernel version 4.4.178-94.91 or lower, the "tsc" is not available at the aforementioned locations".
Following https://aws.amazon.com/premiumsupport/knowledge-center/manage-ec2-linux-clock-source I modified /etc/default/grub by adding " clocksource=tsc tsc=reliable" at the end of GRUB_CMDLINE_LINUX. Then I have run "grub2-mkconfig -o /boot/grub2/grub.cfg" and rebooted. (David also suggested "grubby --update-kernel=ALL --args="tsc=reliable" as an alternative).
But it is still not enough,
dmesg | grep -i tsc
[ 0.000000] Command line: BOOT_IMAGE= ... clocksource=tsc tsc=reliable
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] tsc: Detected 2100.055 MHz processor
[ 0.000000] Kernel command line: BOOT_IMAGE= ... clocksource=tsc tsc=reliable
[ 0.001000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e45649026e, max_idle_ns: 440795259094 ns
[ 0.001000] TSC deadline timer enabled
[ 4.019019] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[ 4.019025] clocksource: 'tsc-early' cs_now: 4208573b47b76a cs_last: 42085716e25bb4 mask: ffffffffffffffff
[ 4.019025] tsc: Marking TSC unstable due to clocksource watchdog
I will open a ticket on our VM infrastructure, and probably rebuild the node on another hypervisor...
Maybe 'tsc=reliable' is not working because of this bug in the linux kernel? (Oct 2019)
http://lkml.iu.edu/hypermail/linux/kernel/1910.0/01948.html
"Introduction of clocksource_tsc_early broke functionality of "tsc=reliable" and "tsc=nowatchdog" boot params, since clocksource_tsc_early is always registered with CLOCK_SOURCE_MUST_VERIFY and thus put on the watchdog list.
If CPU is very busy during boot, the watchdog clocksource may not be read frequently enough, resulting in a large difference that is treated as "negative" by clocksource_delta() and incorrectly truncated to 0.
This false alarm causes TSC to be declared unstable: ..."
Just for the record, I am now using itscrd70 on Centos7 and I do not have this issue. But the underlying issue on other (not all) centos8 vms is not solved yet.
I am closing this very old issue.
First, I did find one of our nodes that has no TSC (itscrd-a100 on centos8). Then I have run eemumu through perf, and it does not seem to have any problem with system calls. So I guess this is no longer a problem even when no TSC is found?
[avalassi@itscrd-a100 gcc11.2/cvmfs] ~> cat /etc/redhat-release; cat /sys/devices/system/clocksource/*/available_clocksource
CentOS Stream release 8
acpi_pm
[avalassi@itscrd-a100 gcc11.2/cvmfs] ~> dmesg | grep -i tsc
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] tsc: Detected 3000.028 MHz processor
[ 0.000000] TSC deadline timer available
[ 0.001000] tsc: Marking TSC unstable due to TSCs unsynchronized
[ 1.689903] systemd[1]: Set hostname to <itscrd-a100.cern.ch>.
[ 3.465207] systemd[1]: Set hostname to <itscrd-a100.cern.ch>.
...
[avalassi@itscrd-a100 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_ll_ll> DEBUGINFOD_PROGRESS=1 perf record -o perf.data -F 1000 ./check.exe -p 256 256 100^C
...
[avalassi@itscrd-a100 gcc11.2/cvmfs] /data/avalassi/GPU2020/madgraph4gpuX/epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_ll_ll> perf report --stdio^C
...
16.26% check.exe libm-2.28.so [.] __ieee754_log_fma
14.05% check.exe libmg5amc_ll_ll_cpp.so [.] mg5amcCpu::FFV2_4_3<KernelAc>
12.97% check.exe check.exe [.] mg5amcCpu::ramboGetMomentaFi>
11.67% check.exe libm-2.28.so [.] __sincos
11.56% check.exe libmg5amc_ll_ll_cpp.so [.] mg5amcCpu::calculate_wavefun>
11.48% check.exe libcurand.so.10.3.0.86 [.] curandGenerateUniformDouble
5.59% check.exe libmg5amc_ll_ll_cpp.so [.] mg5amcCpu::FFV1_0<KernelAcce>
5.42% check.exe libmg5amc_ll_ll_cpp.so [.] mg5amcCpu::FFV2_4_0<KernelAc>
4.13% check.exe libmg5amc_ll_ll_cpp.so [.] mg5amcCpu::CrossSectionKerne>
2.60% check.exe libmg5amc_ll_ll_cpp.so [.] mg5amcCpu::sigmaKin
1.24% check.exe check.exe [.] mg5amcCpu::RamboSamplingKern>
0.92% check.exe check.exe [.] mg5amcCpu::RamboSamplingKern>
0.37% check.exe libc-2.28.so [.] __memset_avx2_unaligned_erms
Second, I have just installed itscrd80 on Alma8, which will be my new reference node. This has TSC anyway
[avalassi@itscrd80 bash] ~> cat /etc/redhat-release; cat /sys/devices/system/clocksource/*/available_clocksource
AlmaLinux release 8.7 (Stone Smilodon)
tsc acpi_pm
[avalassi@itscrd80 bash] ~> dmesg | grep -i tsc
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] tsc: Detected 2099.919 MHz processor
[ 0.000000] TSC deadline timer available
[ 0.001000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e44e3ef694, max_idle_ns: 440795256857 ns
[ 0.248185] clocksource: Switched to clocksource tsc-early
[ 1.759043] tsc: Refined TSC clocksource calibration: 2099.992 MHz
[ 1.759057] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e45290e7fd, max_idle_ns: 440795303772 ns
[ 1.759089] clocksource: Switched to clocksource tsc
[ 2.827358] systemd[1]: Set hostname to <itscrd80.cern.ch>.
So all in all this can be closed
PS(1) By the way, note that perf now fills up ~/.debug with cache. I used this trick to move it out of AFS. Note also that perf record hangs or rather perf record seems to hang while populating the buildid cache
more ~/.perfconfig
# (2023.02.13) avoid filling up ~/.debug on AFS!
# See https://unix.stackexchange.com/questions/669723/what-creates-debug-folder
# See https://perf.wiki.kernel.org/index.php/Tutorial#The_build-id_cache
[buildid]
# Default is ~/.debug, disable using /dev/null
dir = /data/avalassi/PERF_BUILDID
PS(2) A related point, perf also seems to hang because the buildid cache also relies on the debuginfod cache (so it populates ~/.cache/debuginfod_client). See https://wiki.debian.org/Debuginfod https://lore.kernel.org/lkml/YbUCjKYiLBgnDqq8@krava/T/ https://fedoraproject.org/wiki/Debuginfod To disable this (centos8 default is DEBUGINFOD_URLS=https://debuginfod.centos.org):
unset DEBUGINFOD_URLS
To change the cache location
export DEBUGINFOD_CACHE_PATH=/data/avalassi/DEBUGINFOD_CACHE
or even better
export XDG_CACHE_HOME=/data/avalassi/XDG_CACHE
And to display the progress in perf
DEBUGINFOD_PROGRESS=1 perf record
Note: I finally developed some rdtsc based counters in any case, bypassing std::chrono clocks #972
I wanted to reproduce a few performance numbers, in case we wanted to include some in the vchep paper. So I went back in time to older commits and tried to reproduce some numbers. In particular I went back to commit https://github.com/madgraph5/madgraph4gpu/commit/715d26ffe830c3c9794419c45e86d0c04f474365 because this is the first one where I added some numbers to the logs.
I got very surprised because I was unable to reproduce the c++ throughput on my itscrd03. Eventually, I understood that this node gives a VERY different performance (a factor 3 to 4 slower!) than an lxbatch GPU node (like the one I had used for the performance recorded in the commit log) or my physical node pmpe04.
The CPUs are similar Intel's. I wondered if there was a problem with memory access or swappiness, but it does not look like it. Eventually I played a bit with flamegraphs and found out that itscrd03 seems to spent 70% of the CPU time in clock_gettime, unlike the other nodes.
This is itscrd03
And this is pmpe04 using the exact same executable (shared on AFS)
Note that the pmpe throughput 3.5E5 is compatible with that I had recorded on lxbatch of 3.0E5 (a bit higher, it's a physical machine also). Instead that on itscrd03 is a factor 3 lower, which is why I started this investigation.
Here's two flamegraphs (svg converted to png for github):
itscrd03 has 70% of clock_gettime
pmpe04 does not have that overhead
This is about old code, however, there are two things that it may be useful to understand:
Nothing urgent, but useful to keep in mind...