golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.07k stars 17.68k forks source link

runtime/pprof: tests start failing since migration of AIX builder #45170

Open Helflym opened 3 years ago

Helflym commented 3 years ago

We recently moved the AIX VM into a Power9 server, in order to allow P9 builds in a near future. For now, it's still using P8 instructions only, thus nothing should have change for now.

However, runtime/pprof tests start failing: https://build.golang.org/log/aa79ac6509bea68e2e34cb50ac83b454544bdcd6 https://build.golang.org/log/9b24ed78ebd9a7e3bc3fcecb4e7504b6865b545b ...

I have to investigate them.

@laboger Did you already add this kind of bug or it's AIX related ?

Helflym commented 3 years ago

This is related to a change in the configuration we made. We increased the number of possible threads by CPU (smt) from 4 to 8. It seems that pprof is loosing some samples because of that.

@laboger do you have an equivalent on Linux Power of SMT ? I guess yes, and if this case do you have any problems with 8 threads by CPU ?

Note that I've revert temporary to SMT 4 for the builder, to avoid blocking builds too much.

cagedmantis commented 3 years ago

/cc @cherrymui

aixtools commented 3 years ago

Hi. Just a small idea - as I saw in the stats: total 0 CPU profile samples collected

In the past, when using a tool such as hpmstat that uses processor_type bound stats - changing to POWER8 from POWER7 caused me to review the differences (especially new names) - using pmlist -l and `pmlist -p POWERX -c -1)

# bash
# bash-4.4# diff -u <(pmlist -p POWER8 -c -1) <(pmlist -p POWER9 -c -1) | wc -l
3573
Helflym commented 3 years ago

I'm not sure the Golang runtime/pprof package is using the same technology than hpmstat. If I remember correctly, it's using the SIGPROF signal and DWARF information to get where we are.

aixtools commented 3 years ago

It was just a guess - but when I saw 0 samples my first thought was on names/numbers of stats. Sadly, this is an area where binary compatibility has failed (imho).

It was also an issue with PowerVP getting stats from the VIOS (when it was AIX 6.1 based) and moved from POWERX to POWERX+1.

In short: performance stats being broken is a recurring thing when PROC_TYPE goes up.

laboger commented 3 years ago

@Helflym I built and tested on a Linux P9 LE system with SMT=8 and it works without error. I understand SMT=8 is the default for P9 so most likely that is what the P9 LE builders have.

Helflym commented 3 years ago

Ok, we might remove Power8 compat mode and add SMT=8 again. But I just want to be sure that with POWER8 compat mode and SMT=4 it's working too. My local test are ok, but I'm waiting for a few builds made by the farm to be sure.

laboger commented 3 years ago

/cc @pmur

Helflym commented 3 years ago

Even with SMT=4 it seems to be broken but only for remote builds. My local builds seems to be passing without any troubles. We will reboot in Power9 mode to see if there is any improvements. Otherwise, any help would be welcomed as I have no idea about what's happening.

laboger commented 3 years ago

@Helfsym From the output it almost seems like the timer interval is not being set correctly or is being set in a way so it results in fewer samples. I see that AIX has a different way to do the setitimer than linux. I don't know if that helps.

Helflym commented 3 years ago

@laboger yes but that's normal. AIX syscalls are a bit special as we must pass through libc calls.

Anyway, we have move back to Power9 and the tests are still failing. Moreover, it looks like the problem only appears when several tests are run in parallel. I can't manage to make TestMathBigDivide failing when launching it alone unlike when it's run as part of the whole runtime/pprof tests. Is the sample area used by the runtime common to all tests?

Helflym commented 3 years ago

I've been able to investigate a bit more what's going on. All of the SIGPROF are being delivered correctly, but they appear to be deliver to a sleeping thread. The pc passed to sigprof is part of nsleep syscall. In sigprof, they ends up in https://github.com/golang/go/blob/master/src/runtime/proc.go#L4444. This happens a lot of more when the VM isn't busy doing something else. When I've another build of Go running in parallel, most of the tests are passing (I don't want say "every tests" but it's very unlikely to have a test failing when the VM is busy.)

Moreover, looking at the syscalls being launched, I've a lot (like 1/3, 1/2 of the whole syscall for one test) being osyield. These calls might be behind the nsleep caught in sigprof. I don't know what it's calling them, as it's a bit difficult to reproduce the same execution queue inside gdb. It looks like some are coming for runtime.suspendG in runtime.markroot. Linux/ppc64 have a few of them but not as much as AIX, for what I've checked.

I'll continue to investigate a bit. But if I'm finding nothing relevant, I'll push a temporary patch skipping runtime/pprof tests to release the builder.

laboger commented 3 years ago

@Helflym I don't know if this could be relevant, but have you looked at the implementation of procyield in runtime/asm_ppc64x.s? Carlos put that in before AIX was added to Go. Maybe those settings are affecting AIX differently on P9.

Helflym commented 3 years ago

@laboger yeap, I've seen that and it seems to be common to everyone. But I've asked for a confirmation if this is indeed working in AIX too.

gopherbot commented 3 years ago

Change https://golang.org/cl/306489 mentions this issue: runtime/pprof: skip tests for AIX

gopherbot commented 3 years ago

Change https://golang.org/cl/317369 mentions this issue: [release-branch.go1.15] runtime/pprof: skip tests for AIX

gopherbot commented 3 years ago

Change https://golang.org/cl/317297 mentions this issue: [release-branch.go1.16] runtime/pprof: skip tests for AIX