golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.92k stars 17.65k forks source link

runtime: Getting interrupts on isolated cores (caused by simple Go app) during low-latency work #60735

Open p-zak opened 1 year ago

p-zak commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.20.5 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.5"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1121239308=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Details below.

What did you expect to see?

I expect to see no interrupts on isolated cores and want to run Go-based applications on environments when lowlatency workloads are executed.

What did you see instead?

Interrupts on isolated cores on which low-latency work is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores.

Summary

LOC, IWI, RES and CAL interrupts are observed on isolated cores on which low-latency benchmark is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores. Similar Python application doesn't cause such problems.

Tested on Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64) (compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)").

Hardware: 2 x Intel(R) Xeon(R) Gold 6438N (32 cores each)

BIOS: Hyperthreading disabled

OS and configuration: Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64) (compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)" from https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/log/?h=lowlatency-next)

irqbalance stopped and disabled:

$ systemctl stop irqbalance.service
$ systemctl disable irqbalance.service

Based on workload type, experiments and knowledge found in the Internet, following kernel parameters were used:

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-68-lowlatency root=UUID=5c9c2ea3-e0c6-4dd8-ae70-57e0c0af20d3 ro ro rhgb quiet ipv6.disable=1 audit=0 selinux=0 hugepages=256 hugepagesz=1G intel_iommu=on iommu=pt nmi_watchdog=0 mce=off tsc=reliable nosoftlockup hpet=disable skew_tick=1 acpi_pad.disable=1 nowatchdog nomce numa_balancing=disable irqaffinity=0 rcu_nocb_poll processor.max_cstate=0 clocksource=tsc nosmt nohz=on nohz_full=20-23 rcu_nocbs=20-23 isolcpus=nohz,domain,managed_irq,20-23

For every core/socket: Cx states (for x > 0) were disabled, particular power governor was used and fixed uncore values were set. To achieve that power.py script from https://github.com/intel/CommsPowerManagement was used.

prepare_cpus.sh script for setting this up and results
#!/bin/bash

#Disable all possible Cx states (x > 0)
./power.py -d C1_ACPI -r 0-63
./power.py -d C2_ACPI -r 0-63
./power.py -d C1 -r 0-63
./power.py -d C1E -r 0-63
./power.py -d C6 -r 0-63

#Powersave governor
./power.py -g powersave

#Set uncore min and max freq to 1100
./power.py -u 1100
./power.py -U 1100

sleep 3

#Show current status
./power.py -l
![cpu_prepared](https://github.com/golang/go/assets/66442258/073a9af7-27f5-45f4-944b-d14f159ffdf5)

CPUs 20-23 are "isolated" (thanks to proper kernel parameters) - benchmark/workload will be run on them.

Kernel threads were moved from CPU20-23:

$ cat /sys/devices/virtual/workqueue/cpumask
ffffffff,ff0fffff
get_irqs.sh script which checks which target CPUs are permitted for a given IRQ sources
#!/bin/bash

for I in $(ls /proc/irq)
do
    if [[ -d "/proc/irq/$I" ]]
    then
        echo -n $I: 
    cat /proc/irq/$I/smp_affinity_list
    fi
done
output of mentioned script:
0:0
1:0
10:0
11:0
12:0
124:0
13:0
133:0
134:0
135:0
136:0
137:0
138:0
14:0
15:0
16:0
2:0
203:0
212:0
24:0
25:0
26:0
27:0
277:0
278:0
279:0
28:0
280:1
281:2
282:3
283:4
284:5
285:6
286:7
287:8
288:9
289:10
29:0
290:11
291:12
292:13
293:14
294:15
295:16
296:17
297:18
298:19
299:20
3:0
30:0
300:21
301:22
302:23
303:24
304:25
305:26
306:27
307:28
308:29
309:30
31:0
310:31
311:32
312:33
313:34
314:35
315:36
316:37
317:38
318:39
319:40
32:0
320:41
321:42
322:43
323:44
324:45
325:46
326:47
327:48
328:49
329:50
33:0
330:51
331:52
332:53
333:54
334:55
335:56
336:57
337:58
338:59
339:60
34:0
340:61
341:62
342:63
343:0
344:0
345:0
347:0
348:0
349:0
35:0
350:0
351:0
352:0
353:0
354:0
355:0
356:0
357:0
358:0
359:0
36:0
360:0
361:0
362:0
363:0
364:0
365:0
37:0
38:0
39:0
4:0
40:0
41:0
42:0
43:0
44:0
45:0
46:0
47:0
48:0
49:0
5:0
50:0
51:0
52:0
53:0
54:0
55:0
56:0
57:0
58:0
59:0
6:0
7:0
8:0
9:0
lscpu output:
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6438N
    CPU family:          6
    Model:               143
    Thread(s) per core:  1
    Core(s) per socket:  32
    Socket(s):           2
    Stepping:            8
    CPU max MHz:         3600.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pn
                         i pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd 
                         mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsav
                         ec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni
                          avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   3 MiB (64 instances)
  L1i:                   2 MiB (64 instances)
  L2:                    128 MiB (64 instances)
  L3:                    120 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-63
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

JITTER tool - Baseline

jitter is benchmarking tool which is meant for measuring the "jitter" in the execution time caused by OS and/or the underlying architecture.

git clone https://github.com/FDio/archived-pma_tools
cd archived-pma_tools/jitter
Put "run_jitter.sh" script inside above directory.
#!/bin/bash

printf "Start time:\n"
date
#Print current state of interrupts (for: cpu20, cpu21, cpu22, cpu23, cpu24), discard interrupts with only zeros (or empty)
printf "\nInterrupts before:\n"
sed '1s/^/INT:/' /proc/interrupts | awk '{print $1 " " $22 " " $23 " " $24 " " $25 " " $26}' | grep -v "0 0 0 0 0" | grep -v ":  "

printf "\nRun jitter... "
#Start on cpu21, benchmark thread will be run on isolated cpu20
taskset -c 21 ./jitter -c 20 -f -i 1450 -l 15500 > jit.log

printf "finished!\n"

printf "\nInterrupts after:\n"
sed '1s/^/INT:/' /proc/interrupts | awk '{print $1 " " $22 " " $23 " " $24 " " $25 " " $26}' | grep -v "0 0 0 0 0" | grep -v ":  "

printf "\nEnd time:\n"
date

printf "\nResults:\n"
cat jit.log

printf "\n"

Run:

make
./run_jitter.sh

Results:

jitter_base

Comment: jitter tool shows intervals and jitter in CPU Core cycles. Benchmark is done on 2000 MHz core so on graph values are divided by 2 and presented in nanoseconds. Very stable results, no significant jitters (max jitter: 51ns) during 335 seconds. No interruptions made on isolated CPU20 during benchmark.

JITTER tool - Python

hello.py - simple Python app which prints "Hello world" every 1 second
import time

while True:
   print("Hello world")
   time.sleep(1)
run_python_hello.sh - script to run python app on particular (non-isolated) core
#!/bin/bash

taskset -a -c 8 python3 hello.py
$ python3 --version
Python 3.10.6

In first console ./run_python_hello.sh was started, in second console ./run_jitter.sh was run.

Results:

jitter_python

Comment: Acceptable result, one noticeable jitter (1190ns), the remaining jitters did not exceed 60ns during 336 seconds. No interruptions made on isolated CPU20 during benchmark.

JITTER tool - Golang

hello.go - simple Golang app which prints "Hello world" every 1 second
package main

import (
    "fmt"
    "time"
)

func main() {
    for {
        fmt.Println("Hello world")
        time.Sleep(1 * time.Second)
    }
}

go.mod - go module definition
module hello

go 1.20

run_go_hello.sh - script to run Go app on particular (non-isolated) core
#!/bin/bash

taskset -a -c 8 ./hello
$ go version
go version go1.20.5 linux/amd64

In first console Go app was built: go build and started: ./run_go_hello.sh, in second console ./run_jitter.sh was run.

Results:

jitter_go

Comment: 34 significant jitters (the worst had: 44961ns) during 335 seconds. Following interruptions were made on isolated CPU20 during benchmark: LOC: 67 IWI: 34 RES: 34 RES: 34

It seems that every jitter is made every ~10s.

What is also interesting that for idle and isolated CPU22 and CPU23 no interruptions were made during benchmark. For CPU24 (not isolated) only LOC were made (335283 of them).

Notes:

  1. Instead of static isolation (using kernel parameters) I tried also with cpuset and its shield turned on. Unfortunately, results were even worse (jitters were "bigger" and more interruptions were made to shielded cores), moreover cset was not able to move kernel threads outside of shielded pool.
  2. I checked it also on Realtime kernel (GNU/Linux 5.15.65-rt49 x86_64 -> https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.15.65.tar.gz patched with https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.15/older/patch-5.15.65-rt49.patch.gz) and problem with interrupts and jitters done by Go app doesn't exist there. However, RT kernel is not the best solution for everyone and it would be great to not have jitters also on lowlatency tickless kernel.
  3. I also did a lot of experiments with different kernel parameters, seems that this combination was the best (however, maybe I missed something).
  4. Same situation with Go app built using 1.19.x and 1.20.2.
  5. I'm aware that this kind of benchmark should be executed for hours, but for now these results are pretty meaningful.
seankhliao commented 1 year ago

cc @golang/runtime

prattmic commented 1 year ago

Thanks for the report. Since there is a lot of information, here is my paraphrased understanding. Please let me know if I misunderstood anything.

p-zak commented 1 year ago

@prattmic Thanks for quick response!

  • jitter is a C program that looks for jitter in execution time of a tight loop (due to OS interference, e.g.).

Yes! Details from https://github.com/FDio/archived-pma_tools/blob/master/jitter/jitter.c#L21:

 *   The program does the following:
 *      - Calls a dummy function which consists of a small piece of code block being executed in a loop. The loop count is user defined (default 80,000).
 *      - Measures the Execution time for this function in Core cycles
 *      - Compares the Execution time with Instantaneous Minimum and Maximum execution time values tracked during the display update interval. These values are adjusted based on the comparison.
 *      - Compares the execution time with Absolute Minimum and maximum execution values since the program started. These values are adjusted based on the comparison.
 *      - Repeats above operation.
  • The jitter program is running on dedicated cores (no other threads allowed).

Yes, and isolated.

  • A Go program is running on other, non-dedicated cores.

Exactly.

  • The jitter program measures elevated jitter, including additional CPU interrupts, compared to an otherwise idle system and compared to a Python program running on non-dedicated cores.

That's right. Python scenario is for reference. During tests with Go app, Python app was disabled of course.

prattmic commented 1 year ago

This sounds like a limitation in the CPU isolation functionality in the Linux kernel, not a Go issue.

If the Go program is isolated away from a CPU, then its behaviors should presumably not affect that CPU, but it seems like it is. With regards to the interrupts, some possibilities I could imagine:

p-zak commented 1 year ago

This sounds like a limitation in the CPU isolation functionality in the Linux kernel, not a Go issue.

If the Go program is isolated away from a CPU, then its behaviors should presumably not affect that CPU, but it seems like it is.

@prattmic Generally, I agree. However, please look closely at Python scenario :)

As you can see in run_jitter.sh I run tool using: taskset -c 21 ./jitter -c 20 -f -i 1450 -l 15500. It means that initially, thread affinity is set to (isolated) CPU21, but -c 20 changes it to (isolated) CPU20 inside jitter tool (I can run it using taskset -c 20 ./jitter -c 20 and results will be the same for every scenario). Main point is that jitter uses only isolated cores.

Both Python and Go app are run by: taskset -a -c 8. It means that all threads for a given command should run on CPU8 (non-isolated).

jitter runs for about 335s and I collect interrupts from /proc/interrupts just before jitter starts and just after jitter ends for isolated CPU20-23 and non-isolated CPU24 (for reference). Only these kind of interrupts appear for these cores for all scenarios: LOC, IWI, RES, CAL, TLB (I believe that last one is done by jitter when it ends). No other types for these cores.

So here is the difference in these interrupts for "BASELINE" (only jitter running): image

Here is for jitter with Python app: image

And here is for jitter with Go app: image

We can see that Python app doesn't influence isolated cores at all, while Go equivalent unfortunately does. As a Go engineer, I've been trying to find out how to proceed with writing applications in Go but have impact for isolated cores as in Python ;)

What and where should be tuned/configured/changed/fixed to achieve that?

prattmic commented 1 year ago

I acknowledge that there is a difference between your Python and Go cases, but the Python case is also simpler. e.g., with CPython 3.11 on my machine, that program is single-threaded. On the other hand, Go programs are always multi-threaded, even with only one goroutine. So there is more interesting surface in the Go case.

Given the connection with isolated CPUs (a kernel feature) and interrupts (which Go has no direct control over), I think this issue would be better served on a Linux kernel mailing list / issue tracker, where there are more experts in these areas. If there are concrete things we are doing wrong that makes us a bad neighbor, then we can consider whether they should change.

mknyszek commented 1 year ago

+1 to what Michael said, and to add on, here are a few things you can try to attempt to isolate the issue:

However, given that you're seeing jitter on the order of 10 seconds, I don't think there's any obvious thing (to us, anyway) the Go runtime is doing that would cause the jitter. Honestly, I don't really expect any of the 3 things above to have an effect, but it's something to rule out.

Beyond that, I'm not sure we can provide more help here at this time.

p-zak commented 1 year ago

@mknyszek

  • Buffer the print to stdout (os.Stdout and os.Stderr are not buffered by default, IIRC they are with Python.)

Even without printing anything (just for with 1s sleep I have the same results).

  • Run the Go program with GOGC=off (given that you're only printing "hello world" every second, I doubt you'll actually run out of memory in any reasonable amount of time... however I also doubt you're triggering the GC at all).

Forgot to mention that I've tried that already. Also tried GOMAXPROCS=1. No success.

  • Run the Go program with GODEBUG=asyncpreemptoff=1.

This looked promising, unfortunately didn't change anything :/

andig commented 1 year ago

Is the Python program even a good comparison regarding a multithreaded program given the GIL?

p-zak commented 1 year ago

Is the Python program even a good comparison regarding a multithreaded program given the GIL?

@andig From low level perspective (like GIL) such comparison may not be fair, I agree.

However, it makes sense for someone who want to choose language for application basing on how bad/good neighbor it will be in the OS.

I did two more tests...

JITTER tool - RUST

hello.rs - simple RUST app which prints "Hello world" every 1 second
use std::{thread, time::Duration};

fn main() {
    loop {
        println!("Hello world");
        thread::sleep(Duration::from_secs(1));
    }
}
run_rust_hello.sh - script to run RUST app on particular (non-isolated) core
#!/bin/bash

taskset -a -c 8 ./hello
$ rustc --version
rustc 1.70.0 (90c541806 2023-05-31)

In first console RUST app was built: rustc hello.rs and started: ./run_rust_hello.sh, in second console ./run_jitter.sh was run.

Results:

Comment: 1 significant jitter (62154ns), the remaining jitters did not exceed 106ns during 335 seconds. Following interruptions were made during benchmark: int_rust

JITTER tool - C++

hello.cpp - simple C++ app which prints "Hello world" every 1 second
use std::{thread, time::Duration};

fn main() {
    loop {
        println!("Hello world");
        thread::sleep(Duration::from_secs(1));
    }
}
run_cpp_hello.sh - script to run C++ app on particular (non-isolated) core
#!/bin/bash

taskset -a -c 8 ./hello

$ g++ --version
g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

In first console C++ app was built: g++ hello.cpp -o hello and started: ./run_cpp_hello.sh, in second console ./run_jitter.sh was run.

Results:

Comment: 1 significant jitter (70972ns), the remaining jitters did not exceed 59ns during 335 seconds. Following interruptions were made during benchmark: int_cpp

p-zak commented 1 year ago

@mknyszek @andig @prattmic

Originally, I used go1.20.5 linux/amd64 to build simple Golang app for this issue.

Later, I also built it using:

and results were the same.

But look at results I've got when I built this app using go1.21.1 linux/amd64 (with go 1.21 in go.mod):

jitter_go 1 21 1

Comment: 0 significant jitters (the biggest jitter did not exceed 50 during 335 seconds. Following interruptions were made during benchmark:

int_go 1 21 1

Notes:

It would be great to find which commit from these: https://github.com/golang/go/issues?q=milestone%3AGo1.21.1+label%3ACherryPickApproved has actually fixed this problem and root-cause why. Can you help me with that?

prattmic commented 1 year ago

IIRC, nothing in 1.21.1 intentionally affected scheduling. My best guess would be https://github.com/golang/go/issues/62329. To be sure, you should bisect on branch release-branch.go1.21 (the individual releases are tagged).

p-zak commented 1 year ago

@prattmic Right, https://github.com/golang/go/issues/62329 (runtime: MADV_HUGEPAGE causes stalls when allocating memory) was also my best guess :)

These are commit between tag: go1.21.0 and tag: go1.21.1 for release-branch.go1.21 branch:

2c1e5b05fe (tag: go1.21.1) [release-branch.go1.21] go1.21.1
bbd043ff0d [release-branch.go1.21] html/template: properly handle special tags within the script context
b0e1d3ea26 [release-branch.go1.21] html/template: support HTML-like comments in script contexts
d25a935574 [release-branch.go1.21] cmd/go: reject toolchain directives containing path separators
e3ba569c78 [release-branch.go1.21] net/http: revert "support streaming POST content in wasm"
8dc6ad1c61 [release-branch.go1.21] runtime: restore caller's frame pointer when recovering from panic
06df3292a8 [release-branch.go1.21] runtime: avoid MADV_HUGEPAGE for heap memory
b120517ffd [release-branch.go1.21] go/types, types2: remove order dependency in inference involving channels
0a9582163c [release-branch.go1.21] cmd/go: retry ETXTBSY errors when running test binaries
91a4e74b98 [release-branch.go1.21] crypto/tls: QUIC: fix panics when processing post-handshake messages
6385a6fb18 [release-branch.go1.21] cmd/go: find GOROOT using os.Executable when installed to GOROOT/bin/GOOS_GOARCH
2d07bb86f0 [release-branch.go1.21] encoding/xml: overriding by empty namespace when no new name declaration
745b81b6e6 [release-branch.go1.21] encoding/gob: prevent panic from index out of range in Decoder.typeString
13339c75b8 [release-branch.go1.21] runtime: fix maps.Clone bug when cloning a map mid-grow
2977709875 [release-branch.go1.21] context: fix synchronization in ExampleAfterFunc_cond
2d4746f37b [release-branch.go1.21] go/types, types2: disable interface inference for versions before Go 1.21
2b8026f025 [release-branch.go1.21] cmd/compile: in expandCalls, move all arg marshalling into call block
7c97cc7d97 [release-branch.go1.21] Revert "os: use handle based APIs to read directories on windows"
cb6ea94996 [release-branch.go1.21] Revert "cmd/compile: omit redundant sign/unsign extension on arm64"
45b98bfb79 [release-branch.go1.21] path/filepath: don't drop .. elements when cleaning invalid Windows paths
bac083a584 [release-branch.go1.21] cmd/link: don't mangle string symbol names
70aa116c4a [release-branch.go1.21] runtime/internal/wasitest: skip racy TCP echo test
31c5a236bc [release-branch.go1.21] runtime: mark traceEnabled and traceShuttingDown as no:split
25ec110e51 [release-branch.go1.21] cmd/compile: ensure empty blocks in write barriers are marked unpreemptible
6634ce2f41 [release-branch.go1.21] runtime: profiling on Darwin cannot use blocking reads
25c6dce188 [release-branch.go1.21] cmd/compile: make backingArrayPtrLen to return typecheck-ed nodes
4e34f2e81d [release-branch.go1.21] go/types, types2: don't panic during interface completion
d91843ff67 [release-branch.go1.21] go/types, types2: use correct parameter list when checking argument passing
7437db1085 [release-branch.go1.21] go/types, types2: use exact unification when comparing interface methods
ed527ecfb2 [release-branch.go1.21] cmd/api: rename api.go to main_test.go
b78e8cc145 [release-branch.go1.21] crypto/tls: add GODEBUG to control max RSA key size
3475e6af4c [release-branch.go1.21] cmd/go: fix missing case checking for empty slice
179821c9e1 [release-branch.go1.21] net/http: permit requests with invalid Host headers
9398951479 [release-branch.go1.21] cmd/distpack: include directory entries in tar files
75d8be5fb4 [release-branch.go1.21] cmd/go/internal/web: release the net token when an HTTP request fails due to CheckRedirect
1755d14559 [release-branch.go1.21] cmd/compile: fix missing init nodes for len(string([]byte)) optimization
c19c4c566c (tag: go1.21.0) [release-branch.go1.21] go1.21.0

So I compiled go from sources from this commit: 06df3292a8 [release-branch.go1.21] runtime: avoid MADV_HUGEPAGE for heap memory and compile my simple app using compiled go. Results: No jitters and no interrupts were observed

After that I compiled go from sources from one commit before: b120517ffd [release-branch.go1.21] go/types, types2: remove order dependency in inference involving channels and compile my simple app using compiled go. Results: Jitters and interrupts were observed

So we can assume that side effect of fixing https://github.com/golang/go/issues/62329 (original PR: https://github.com/golang/go/issues/61718 created by @dominikh) is that problems described in this PR were also fixed.

Here is commit: https://github.com/golang/go/commit/06df3292a88929cde5fb054ae8a84e26a625e603

And I'm still very curious what was the root cause of problem described in this PR.

mknyszek commented 1 year ago

And I'm still very curious what was the root cause of problem described in this PR.

It's a long story. https://go.dev/doc/gc-guide#Linux_transparent_huge_pages has some background, but from there I think most of the detail is all in https://github.com/golang/go/issues/61718. Going all the way back, this started with https://bugzilla.kernel.org/show_bug.cgi?id=93111, which is why the runtime (for releases prior to Go 1.21.0) was calling madvise(MADV_HUGEPAGE) and madvise(MADV_NOHUGEPAGE) at all.

Coming back to the current issue, @prattmic theorizes that forcing huge pages on and off via madvise(MADV_HUGEPAGE) and madvise(MADV_NOHUGEPAGE) results in TLB shootdowns down the line that end up causing other programs to stall when they access memory and take a very slow path through the kernel and hardware.

The TL;DR is that Go 1.21.1 doesn't make these calls anymore. There are a handful of cases where they're still used, but the maximum number of calls to these syscalls in any given program is constant and small.

p-zak commented 1 year ago

@mknyszek Thank you very much for the detailed response! I have one more question.

For applications built with Go 1.21.1 MADV_COLLAPSE works for Linux 6.1 and higher, right? So in my case (Linux 5.15), madvice most likely doesn't do anything and returns some error, which is intentionally not handled in the Go runtime. Am I correct? image

I'm wondering (and I don't have a way to check it right now) what impact the correct execution of the madvise(v, n, _MADV_COLLAPSE) operation for Linux 6.1+ may have on my workloads, their latency, and interrupts.