Open Helflym opened 3 years ago
This is related to a change in the configuration we made. We increased the number of possible threads by CPU (smt) from 4 to 8. It seems that pprof is loosing some samples because of that.
@laboger do you have an equivalent on Linux Power of SMT ? I guess yes, and if this case do you have any problems with 8 threads by CPU ?
Note that I've revert temporary to SMT 4 for the builder, to avoid blocking builds too much.
/cc @cherrymui
Hi. Just a small idea - as I saw in the stats: total 0 CPU profile samples collected
In the past, when using a tool such as hpmstat
that uses processor_type bound stats - changing to POWER8 from POWER7 caused me to review the differences (especially new names) - using pmlist -l
and `pmlist -p POWERX -c -1)
# bash
# bash-4.4# diff -u <(pmlist -p POWER8 -c -1) <(pmlist -p POWER9 -c -1) | wc -l
3573
I'm not sure the Golang runtime/pprof
package is using the same technology than hpmstat
. If I remember correctly, it's using the SIGPROF signal and DWARF information to get where we are.
It was just a guess - but when I saw 0 samples
my first thought was on names/numbers
of stats. Sadly, this is an area where binary compatibility has failed (imho).
It was also an issue with PowerVP getting stats from the VIOS (when it was AIX 6.1 based) and moved from POWERX to POWERX+1.
In short: performance stats being broken is a recurring thing when PROC_TYPE goes up.
@Helflym I built and tested on a Linux P9 LE system with SMT=8 and it works without error. I understand SMT=8 is the default for P9 so most likely that is what the P9 LE builders have.
Ok, we might remove Power8 compat mode and add SMT=8 again. But I just want to be sure that with POWER8 compat mode and SMT=4 it's working too. My local test are ok, but I'm waiting for a few builds made by the farm to be sure.
/cc @pmur
Even with SMT=4 it seems to be broken but only for remote builds. My local builds seems to be passing without any troubles. We will reboot in Power9 mode to see if there is any improvements. Otherwise, any help would be welcomed as I have no idea about what's happening.
@Helfsym From the output it almost seems like the timer interval is not being set correctly or is being set in a way so it results in fewer samples. I see that AIX has a different way to do the setitimer than linux. I don't know if that helps.
@laboger yes but that's normal. AIX syscalls are a bit special as we must pass through libc calls.
Anyway, we have move back to Power9 and the tests are still failing.
Moreover, it looks like the problem only appears when several tests are run in parallel. I can't manage to make TestMathBigDivide
failing when launching it alone unlike when it's run as part of the whole runtime/pprof
tests. Is the sample area used by the runtime common to all tests?
I've been able to investigate a bit more what's going on.
All of the SIGPROF are being delivered correctly, but they appear to be deliver to a sleeping thread. The pc
passed to sigprof
is part of nsleep
syscall. In sigprof
, they ends up in https://github.com/golang/go/blob/master/src/runtime/proc.go#L4444.
This happens a lot of more when the VM isn't busy doing something else. When I've another build of Go running in parallel, most of the tests are passing (I don't want say "every tests" but it's very unlikely to have a test failing when the VM is busy.)
Moreover, looking at the syscalls being launched, I've a lot (like 1/3, 1/2 of the whole syscall for one test) being osyield
. These calls might be behind the nsleep
caught in sigprof
.
I don't know what it's calling them, as it's a bit difficult to reproduce the same execution queue inside gdb. It looks like some are coming for runtime.suspendG
in runtime.markroot
. Linux/ppc64 have a few of them but not as much as AIX, for what I've checked.
I'll continue to investigate a bit. But if I'm finding nothing relevant, I'll push a temporary patch skipping runtime/pprof
tests to release the builder.
@Helflym I don't know if this could be relevant, but have you looked at the implementation of procyield in runtime/asm_ppc64x.s? Carlos put that in before AIX was added to Go. Maybe those settings are affecting AIX differently on P9.
@laboger yeap, I've seen that and it seems to be common to everyone. But I've asked for a confirmation if this is indeed working in AIX too.
Change https://golang.org/cl/306489 mentions this issue: runtime/pprof: skip tests for AIX
Change https://golang.org/cl/317369 mentions this issue: [release-branch.go1.15] runtime/pprof: skip tests for AIX
Change https://golang.org/cl/317297 mentions this issue: [release-branch.go1.16] runtime/pprof: skip tests for AIX
We recently moved the AIX VM into a Power9 server, in order to allow P9 builds in a near future. For now, it's still using P8 instructions only, thus nothing should have change for now.
However,
runtime/pprof
tests start failing: https://build.golang.org/log/aa79ac6509bea68e2e34cb50ac83b454544bdcd6 https://build.golang.org/log/9b24ed78ebd9a7e3bc3fcecb4e7504b6865b545b ...I have to investigate them.
@laboger Did you already add this kind of bug or it's AIX related ?