Open erifan opened 2 years ago
I have a few minor questions:
- What is the output format of these profiles? Still pprof, I assume (rather than perf.data)? In that case, are there special considerations we need to take for particular perf events that may not map nicely to the existing pprof format, or should everything fit nicely?
Yes, still pprof. I didn't change the output format, just replace the original signal source (os-timer) with the PMU counter. The signal handler is also basically unchanged. So pprof is fitting well. The only change is to change the original sample type from nanoseconds
to count
. I'd like to convert count
to nanoseconds
because sometimes count
doesn't seem so intuitive, but there doesn't seem to be a proper way to do this conversion.
- What is the plan for non-Linux OSes? proposal: runtime/pprof: add PMU-based profiles #36821 (comment) mentions that Darwin and Windows don't have clear PMU interfaces. Is PerfProfile only exported on Linux? Or Start returns an error on non-Linux?
Yes, my plan is to only support linux, if we find a appropriate method to support PMU on Windows and Darwin in the future, we will add support for them then. It may be appropriate to panic in Start before it is implemented.
Would it be possible to measure multiple events simultaneously by creating a PerfProfile
for every event of interest? Alternatively, by giving PerfProfile
a method like func (*PerfProfile) MeasureEvent(name string, period int)
that can be called multiple times, and having all events go in the same profile?
Would it be possible to measure multiple events simultaneously by creating a
PerfProfile
for every event of interest? Alternatively, by givingPerfProfile
a method likefunc (*PerfProfile) MeasureEvent(name string, period int)
that can be called multiple times, and having all events go in the same profile?
My prototype implementation doesn't account for multiple events, but it's not hard to support multiple events. I can do it however I'm not a proposal approver. Thanks.
What is the plan for non-Linux OSes?
I did some research today on other OSes today to get a sense of the capabilities.
The Windows Performance Recorder CLI supports collecting some PMU events. There is an API to control profile collection, though I believe this communicates with a centralized service that actually does profile collection. As far as I can tell, there is no direct kernel interface. Additionally, it is not possible to limit collection to a specific process as far as I can tell.
Unanswered questions:
Open source libraries like PCM use a custom Windows kernel driver to access PMUs.
Visual Studio has a command capable of collecting PMU events for a specific process. It is unclear to me how this works. Does it use the WPR API?
My takeaway is that there isn't really anything feasible for us to programmatically integrate with, unless I've missed something.
Xcode Instruments supports collecting counts from PMU events. How it does so is unclear.
PCM requires a custom kernel driver, like on Windows.
The Darwin kernel has a PMC subsystem called kpc
. It looks like this is exposed to userspace, but is only usable by one "blessed" process?
I suspect that there is nothing feasible here either, but I am less confident than with Windows.
FreeBSD has a complete libpmc
library for accessing hardware performance counters in counting or sampling mode (includes support for sending SIGPROF). Internally this uses system calls to a kernel module. I haven't looked into this in much detail, but clearly there is something feasible here.
Thanks @prattmic for the research, I don't know much about other OSes, I'll look into how to access the PMU programmatically on other OS's. I wonder if you guys @golang/windows can provide some hints for Windows?
Also I wonder if it is really difficult to access PMU on a OS, can we not support it on this OS? Just like per thread OS timer, we also only support it on Linux.
I should have added more context to my previous comment. The purpose of my research was to get a sense of how hard we should try to design a platform-independent API (even if only Linux is supported initially). My initial take-away is that it is probably less important if Windows and Darwin will be impossible to support.
In this proposal, you are proposing a new type PerfProfile
. However, your original CL introduced the previously approved (#42502) CPUProfile
type and extended it with perf event support.
I actually like that approach better than adding a new top-level perf-specific profile type for a few reasons:
cycles
event would be a CPU cycles profile. Similar with instructions
, branch mispredictions
. Even events like L1D cache misses are attributed to specific instructions, and thus I think could be reasonably considered CPU profiles.
StartCPUProfile
/CPUProfile
/-cpuprofile
if the PMU is available. I think this conceptually make the most sense if the PMU options are directly in CPUProfile
.
CPUProfile
could just provide options for timer
(ITIMER_PROF), cycles
(PMU), cycles-precise
(PMU+PEBS equivalent) and any more advanced usage is in another profile type.I also think that using Set methods rather than exported fields provides more flexibility on the struct internals plus the ability to return errors, though I don’t feel too strongly about this as it does make the API below more verbose. One such possible API:
type CPUProfile struct { ... }
func (*CPUProfile) SetRate(int) error
func (*CPUProfile) SetSource(CPUProfileSource) error
type CPUProfileSource int
const (
CPUTimer CPUProfileSource = iota
CPUCycles
CPUCyclesPrecise
Custom // Configure via Sys().
)
// Returns *LinuxCPUProfile on Linux.
func (*CPUProfile) Sys() any
// LinuxCPUProfile contains Linux perf specific advanced configuration options.
type LinuxCPUProfile struct{ ... }
// Use one of the generic PERF_TYPE_HARDWARE events.
func (*LinuxCPUProfile) SetHardwareSource(LinuxCPUProfileHardwareSource) error
// Use a raw hardware event (PERF_TYPE_RAW). Format of config, config1, and config2 are microarchitecture dependent.
func (*LinuxCPUProfile) SetRawSource(config, config1, config2 uint64) error
// Additional options. e.g., setting the “precise” flag.
The idea of hiding OS-dependent details behind a Sys()
method comes from os.ProcessState.Sys()
and os/exec.Cmd.SysProcAttr
. I’m not the biggest fan of this pattern, but it is nice to be able to provide advanced options without needing to an OS-independent API for every feature, especially given https://github.com/golang/go/issues/53286#issuecomment-1194696910 indicating that we are unlikely to support Windows or Darwin anyways.
Another big unanswered question is how events should be described. The proposed PerfProfile
API says Event string // perf event names, such as cycles, instructions
. Are these names as described in the documentation for perf record -e
and shown in perf list
?
That would be great, but presents a big problem: these names aren’t understood directly by the kernel interface.
The simple generic events like cycles
and instructions
are simply stringified versions of PERF_TYPE_HARDWARE
, so those are simple to support. However, another event I’ve used recently for example is cycle_activity.stalls_mem_any
. This is defined here (for Skylake, other files for other microarches). This JSON file is used to generate code for the userspace perf tool that describes the raw event values for this named event. libpfm4 has a similar giant table that describes this event.
There is such a huge number of possible events here that I don’t think it is feasible for the Go standard library to be in the business of understanding the names of every single event. In the proposed API above, I’ve addressed this by directly supporting the generic PERF_TYPE_HARDWARE
events and for anything beyond that just providing SetRawSource
that accepts PERF_TYPE_RAW
config values (which would be derived from the EventCode
, UMask
, etc seen in the JSON file). SetRawSource
is not user-friendly, but it provides just enough API for a third-party libpfm4-style package to understand the full set of possible events to help users configure it.
I will also note that as a first-step proposal, we could just take the simple part of the hardware CPU profile API above. i.e.,
type CPUProfile struct { ... }
func (*CPUProfile) SetRate(int) error
func (*CPUProfile) SetSource(CPUProfileSource) error
type CPUProfileSource int
const (
CPUTimer CPUProfileSource = iota
CPUCycles
CPUCyclesPrecise
)
That is, we add support for using a hardware cycle counter for CPU profiles (perhaps exposing a ‘precise’ option, perhaps not). Nothing more. Under the hood this uses the perf interface, but we don’t expose an API for using any other event types.
I will also note that as a first-step proposal, we could just take the simple part of the hardware CPU profile API above.
I don't have a particularly strong opinion on how to define this API, so I think your proposal is also good, not exposing the internal fields of CPUProfile gives more flexibility to our implementation. And I don't think it's necessary to support too much events, including raw hardware event?
Most of the events that you’d use are CPU profiles
Yes, I think we will only support a few common PERF_TYPE_HARDWARE
events. Do we have to support too many perf events? Because in my opinion our main purpose is to improve the CPU profiling precise of the pprof, not to implement a complete perf
.
Are these names as described in the documentation for perf record -e and shown in perf list?
Yes, they are. In contrast to exported constants, we don't need to export these strings, we won't support a lot of perf events.
const ( CPUTimer CPUProfileSource = iota CPUCycles CPUCyclesPrecise ) Maybe we can support a little more events. CL 410799 supported 6 events:
cycles, instructions, cache-references, cache-misses, branch-misses, bus-cycles.
I think at least
cycles
,instructions
,cache-misses
andbranch-misses
are still quite important.My initial take-away is that it is probably less important if Windows and Darwin will be impossible to support.
So this is the basis of the above discussion. Does it make sense if we can't find a proper way to access the PMU in Darwin and Windows? As far as I know, there is currently no way to access the Arm PMU on Windows, Arm has some work in progress to provide low level profiling infrastructure and tool for Windows, but it also seems to be struggling to meet our needs. So can we support the above API only on Linux? If so then what's the behavior of SetSource
on unsupported OSes? Panic or use os timer by default?
Yes, I think we will only support a few common
PERF_TYPE_HARDWARE
events. Do we have to support too many perf events? Because in my opinion our main purpose is to improve the CPU profiling precise of the pprof, not to implement a completeperf
.
I completely agree. I was trying to be flexible since I've seen desire for arbitrary events (such as in #36821), but I think focusing just on cycles
first makes the most sense.
Maybe we can support a little more events. I think at least
cycles
,instructions
,cache-misses
andbranch-misses
are still quite important.
I think it is hard to decide what is important. There was a lot of discussion on #36821 about paring down the initial proposal to focus narrowly on improving CPU profiles without adding tons of extra options, and I think that is pertinent here. If there is general agreement that just cycles
is valuable, then I think we start there.
So can we support the above API only on Linux? If so then what's the behavior of
SetSource
on unsupported OSes? Panic or use os timer by default?
Yes, it would only be supported on Linux.
The way I envisioned this API is that NewCPUProfile()
automatically selects the best available profile source. Users that know they want a specific profile source can call SetSource
, which returns an error if that source is not available (either due to unsupported OS or lack of HW support). The primary use-cases I imagine for this API are:
cycles
profile and would rather not profile at all if that isn't available.P.S. it's possible that (1) will prevent us from making CPUProfile
/StartCPUProfile
automatically use PMU profiles because it is too much of a behavior change. I think this needs more thought, but I hope it isn't a blocker because I'd love to magically provide better profiles for a new release rather than hiding them behind a method most folks won't use.
P.P.S. since we are introducing the first implementation of NewCPUProfile
with this proposal, a halfway point would be for NewCPUProfile
to automatically select the best available source, but StartCPUProfile
to always use the OS timer. Existing code wouldn't automatically get better profiles, but StartCPUProfile
would be effectively deprecated, so new code would at least get better profiles by default.
P.S. it's possible that (1) will prevent us from making
CPUProfile
/StartCPUProfile
automatically use PMU profiles because it is too much of a behavior change. I think this needs more thought, but I hope it isn't a blocker because I'd love to magically provide better profiles for a new release rather than hiding them behind a method most folks won't use.
I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds
, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)
One concern that I believe @rhysh raised on #36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.
I tested with an application that does some spinning in userspace as well as performs an expensive system call.
Here are the results:
pprof
(CPU timers):
sudo perf -e cycles
(kernel+user):
perf -e cycles:u
(user only):
My perf_event_paranoid == 2
, which prevents access to kernel sampling. Simply using perf record -e cycles
has the same behavior of auto-selecting user mode only.
Perhaps there is a way to keep counting in kernel mode, but only receive user stacks, which would match the pprof
behavior, but honestly I doubt it since I can imagine that even getting counts with small periods could provide interesting side-channel security insights into what the kernel is doing that they'd want to stop.
(Edit: I dug through the kernel code a bit and it doesn't look like this is possible. In fact, it turns out that at least for Intel chips the PMU hardware itself has config bits for ring 0 and ring 3 counting, so the kernel doesn't even have to turn counting off/on on entry/exit.)
I think not counting is kernel mode is perhaps OK for an opt-in API, but it may be too big of a change from pprof to automatically select a PMU profile over a CPU timer.
One concern that I believe @rhysh raised on https://github.com/golang/go/issues/36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.
I think that comment is https://github.com/golang/go/issues/36821#issuecomment-664646116 . The design of the perf_event support at the time used a file descriptor per thread, and to prevent the profiler from using many more FDs than GOMAXPROCS it took explicit action to detach the perf_event FD from the thread when making a syscall. So my comment wasn't about a limitation of the perf_event interface (though from your update today, @prattmic , that limitation seems exist!), it was about how the proposed Go runtime change chose to spend file descriptors.
Yes, a default profiler that reports syscalls as taking no time at all seems likely to confuse.
I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)
For what it's worth, the feedback on one of my early attempts to change the default profiler (https://go.dev/cl/204279) included:
Right now, in practice, everybody sets the profiling rate to 100, runtime/pprof.StartCPUProfile. [...] I'm concerned that this will confuse people used to the standard rate.
and
I expect that many people look at the sample count and mentally divide by 100 to get the time spent.
But, there's a much better reason this time.
If there is general agreement that just cycles is valuable, then I think we start there.
Ok, this can be a good start. If people feel that other events are needed in the future, it will be much simpler.
I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)
Yes, as you mentioned before, we can keep StartCPUProfile
still using the OS timer so that its behavior doesn't change. For users who don't care about the unit of the sample and only care about the profiling precision, we choose the best available source for them in NewCPUProfile
. For users who wish to use a definite source can use SetSource
to set the source.
One concern that I believe @rhysh raised on https://github.com/golang/go/issues/36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.
perf_event_open has an option exclude_kernel
, enabling it can exclude kernel events, so my understanding is that if it is not set, the kernel events will be counted, that is, the PMU will count in the kernel execution state. I'm not sure if you are referring to this.
Perhaps there is a way to keep counting in kernel mode, but only receive user stacks, which would match the pprof behavior
If exclude_kernel is not set and the PMU keeps counting in the kernel mode, then this is feasible. As for the callchain, we collect it by stack tracing in go, so I don't think there will be any difference. Unless we want to read the FD directly to get the callchain, but there also seems to be options (exclude_callchain_kernel
and exclude_callchain_user
) to control whether to include the user or kernel callchain.
In my latest code, if it's pure go code, I set this option because it helps the precision a little bit. To be honest, there is no right or wrong to count the execution of the kernel or not. For example, if a kernel operation affects multiple threads (such as allocating a new heap arena), which thread is appropriate to assign this time to? But considering that to be consistent with the behavior of the OS timer, it may be reasonable to let the PMU counting the event of the kernel.
So my comment wasn't about a limitation of the perf_event interface (though from your update today, @prattmic , that limitation seems exist!), it was about how the proposed Go runtime change chose to spend file descriptors.
Thank @rhysh for raising this problem, I think we have some discussions in CL 410798 and 36821, and there seems to be no good solution for this problem. I didn't deal with this in CL 410798 because I think such cases are really rare, especially for go, where OS thread is shared by multiple goroutines. I'd like to hear more people's thoughts on whether this is an issue that needs to be dealt with.
How does the perf_event_open profiler deal with threads that aren't created by the Go runtime?
@rhysh you asked this question in https://github.com/golang/go/issues/36821#issuecomment-664646116, and this question really stumped me. With the current framework, non-go threads can only receive profiling signals from the process signal source created by setitimer, so I installed a PMU counter for the process. However the non-go thread rarely receive the profiling signals, I thought the process signal would randomly choose a thread to receive, but it doesn't seem to be so random. I looked at the code ( complete_signal) of the kernel, it seems that the main thread always gets the signal first. Since our main thread does not block the SIGPROF signal, most of the signals are received by the main thread, and the main thread has a per-thread PMU counter, so it ignores the signal. I don't know how to fix this problem.
@rsc Would you mind adding this proposal to active queue ? Thanks.
perf_event_open has an option
exclude_kernel
, enabling it can exclude kernel events, so my understanding is that if it is not set, the kernel events will be counted, that is, the PMU will count in the kernel execution state. I'm not sure if you are referring to this.
Yes, that is correct. However, if /proc/sys/kernel/perf_event_paranoid
>= 2, then setting exclude_kernel=false
requires CAP_PERFMON
or CAP_SYS_ADMIN
(effectively, you must be root): https://elixir.bootlin.com/linux/v5.18.14/source/include/linux/perf_event.h#L1351
(N.B., the perf tool automatically detects this case and sets exclude_kernel=true
if you are unpriviledged)
As far as I know, must Linux distributions set /proc/sys/kernel/perf_event_paranoid
to 2 by default. In practice this means that very few Go programs will be able to enable kernel counting.
I was brainstorming with @cherrymui earlier and one idea that came up is that we could collect both the OS timer and PMU profiles by default. The pprof format supports multiple different sample types in one file, so putting both in one file would not be a problem. I think the main downside is that we'd approximately double the overhead because we'd have SIGPROFs for both profiles.
However, if /proc/sys/kernel/perf_event_paranoid >= 2, then setting exclude_kernel=false requires CAP_PERFMON or CAP_SYS_ADMIN (effectively, you must be root)
Well, that's a bit of a hassle. I wonder if we can do a check on this before starting the PMU, if the environment has the appropriate permissions we will allow the PMU profiling, otherwise not. I think usually the development environment and debugging environment meet these conditions.
Or we don't count the execution of the kernel, which behaves a little differently from the OS timer source, but is actually more accurate for user programs. This is also the default mode of perf.
one idea that came up is that we could collect both the OS timer and PMU profiles by default.
This is a bit tricky because different types of samples have different units, so we need to differentiate between them.
I wonder if we can do a check on this before starting the PMU, if the environment has the appropriate permissions we will allow the PMU profiling, otherwise not. I think usually the development environment and debugging environment meet these conditions.
Certainly we can check permissions ahead of time, but I am skeptical that most development environments will have perf_event_paranoid < 2
. Perhaps we should do a survey of the defaults across some common Linux distributions.
This is a bit tricky because different types of samples have different units, so we need to differentiate between them.
Yes, the pprof
tool treats them as completely different, and only displays one at a time. You can see an example of this behavior by collecting a perf.data with multiple counters:
$ perf record -e cycles,instructions ./set
$ pprof perf.data
File: set
perf-version:5.17.11
perf-command:/usr/bin/perf record -e cycles,instructions ./set
Type: instructions:u_event
Duration: 5.43s, Total samples = 18125609784
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)
Note that this is displaying Type: instructions:u_event
.
(pprof) options
...
sample_index = 0 //: [cycles:u_sample | cycles:u_event | instructions:u_sample | instructions:u_event]
...
Here are all the included types. Select cycles:u_event
with pprof -sample_index=1
.
but I am skeptical that most development environments will have perf_event_paranoid < 2. Perhaps we should do a survey of the defaults across some common Linux distributions.
I think you are right that the default perf_event_paranoid
value is 2 and 3 or 4 on some distributions, I wonder if we can make PMU profiling a feature that requires users to give more permissions or privileged users, similar with perf
tool. Or don't count kernel events.
Yes, the pprof tool treats them as completely different, and only displays one at a time. You can see an example of this behavior by collecting a perf.data with multiple counters:
But I think what we want to have is one accurate event rather than two less accurate events.
Based on the discussion above, I make a summary.
The proposal (by @prattmic) is this, I removed CPUCyclesPrecise
, target for Linux only.
// 1, Default event type is os-timer.
// 2, Default period is provided.
type CPUProfile struct { ... }
// 1, The unit of the period is related to the event type. For time-based events such as os-timer, it's nanosecond,
// for other events, it is the count of events that passed between samples.
// 2, If the parameter <= 0, reset it to 0, stop profiling.
// 3, The profiling period has a minimum threshold to prevent the profiling rate from being too high.
// When the parameter value exceeds the threshold, it will be adjusted to the minimum threshold.
// 4, If c is nil, report an error and return.
// 5, If the profiling has been started, calling this function will report an error, but will not affect the running profiling.
func (c *CPUProfile) SetPeriod(int64) error
// 1, If c is nil, report an error and return.
// 2, If the profiling has been started, we can't resest the event by call SetEvent.
// 3, If the source is not available, report and error and return.
// 4, The second parameter indicates whether to count kernel events. For the os-timer event, this parameter is always true.
// 5, For PMU to count the kernel events, the CAP_PERFORM permission or the admin permission is required, if the condition is not met and the second parameter is also specified as true, an error will be reported.
// 6, If cgo is used, only the CPUTimer event can be used.
// 7, The precise_ip of the PMU event is set to the maximum value available.
func (c *CPUProfile) SetEvent(CPUProfileEvent, bool) error
// SetSampleKernel sets whether to count the kernel events, pass in “true” to count the kernel events, otherwise not count.
// This function has no effect on os-timer based events, because os-timer events always count kernel events.
// To count the kernel events, the CAP_PERFORM permission or the admin permission is required.
func (c *CPUProfile) SetSampleKernel(bool) error
// 1, If the method receiver c is nil, report an error.
// 2, If the profiling has been started, report an error.
func (c *CPUProfile) Start(w io.Writer) error
// Stop stops a running CPUProfile and flushes any pending writes.
// It returns only after all writes for the profile have completed.
// If c is not running, Stop does nothing and returns nil.
//
// It is the caller's responsibility to close the profile's Writer if appropriate.
//
// A CPUProfile can be started again after calling Stop.
func (c *CPUProfile) Stop() error
type CPUProfileEvent int
const (
CPUTimer CPUProfileEvent = iota
CPUCycles
CPUBranchMisses
CPUCacheMisses
CPUInstructions
)
There are three problems: First, FD may be depleted. But that's probably not a problem as this is very rare. Disposal method: Do not handle.
Second, for PMU to count the kernel events, perf_event_open
requires the CAP_PERFORM permission or the admin permission, otherwise the behavior of the PMU will be inconsistent with the behavior of the OS timer.
Disposal method:
Third, it is difficult for non-go threads to receive signals from the PMU. Disposal method: If CGO profiling is on, only os-timer events can be used.
Everyone is welcome to brainstorm to move this proposal forward, thanks.
Edit1: Add Start method.
Edit2: Added some problem and behavior descriptions that I'm not sure about.
Edit3: Change func (c *CPUProfile) SetRate(int) error
to func (c *CPUProfile) SetPeriod(int64) error
.
Rename CPUProfileSource
to CPUProfileEvent
.
Rename func (c *CPUProfile) SetSource(CPUProfileSource) error
to func (c *CPUProfile) SetEvent(CPUProfileEvent) error
.
Add func SetPMUEventPrecise(PMUEventPrecise ) error
and type PMUEventPrecise int
.
Add CPUBranchMisses
, CPUCacheMisses
and CPUInstructions
.
Add Disposal methods for the above three problems.
Edit4: Change func SetPMUEventPrecise(PMUEventPrecise) error
to func (c *CPUProfile) SetPMUEventSkid(PMUEventSkid) error
Add func (c *CPUProfile) Stop(w io.Writer) error
method
Modified some behavior descriptions.
Edit5: Modified some behavior descriptions.
Changes func (c *CPUProfile) SetEvent(CPUProfileEvent) error
to func (c *CPUProfile) SetEvent( CPUProfileEvent, bool) error
Edit6: Add method func (c *CPUProfile) SetSampleKernel(bool) error
Edit7: updated the behavior description of the Start
method
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
To add to the second issue in https://github.com/golang/go/issues/53286#issuecomment-1200714095 (no kernel events), I think that this difference is acceptable for an explicit API (i.e., users must explicitly call SetSource(CPUCycles)
), but that it is too big of a difference to automatically change StartCPUProfile
to using the PMU when available.
it is too big of a difference to automatically change StartCPUProfile to using the PMU when available.
If the second issue is not a problem, I personally think that this is not a big problem, it is just a behavioral rule. My opinion is the same as yours, StartCPUProfile uses os-timer by default, so we keep the original behavior. And leave the new API to users who need it.
Third, it is difficult for non-go threads to receive signals from the PMU. Currently I haven't found any possible solution.
For this problem I think we can do like this:
It sounds like maybe people are happy with https://github.com/golang/go/issues/53286#issuecomment-1200714095?
Are there any other likely CPUProfileSource values that we expect to support? Right now it's looking like a bool, and if we had a few more it might help us understand the shape of the API better.
I think we eventually want to support precise sampling (i.e. PEBS, CPUCyclesPrecise
mentioned earlier). Though that could be considered more of a 'modifier' that could be set with a different method. I don't think it should be completely automatic, as I think users should have a way to require precision, or at least know that they aren't getting it.
https://github.com/golang/go/issues/53286#issuecomment-1196497300 also mentions interest in instruction sampling, cache misses (which cache level?) and branch misprediction. I am not necessarily convinced one way or another whether we want these specific events, as there is a huge tail of potential events, but I do think we should keep things more open than a boolean.
I don't think it should be completely automatic, as I think users should have a way to require precision, or at least know that they aren't getting it.
Do you mean perf_event_open.precise_ip
? We can set four different precision values for it, so if we specify the precision by CPUCyclesPrecise
event, we need to set four different events, maybe we only care about one of them? Another way is to create a new function such as SetPresice(int)
or add a precise parameter to the SetRate
function.
https://github.com/golang/go/issues/53286#issuecomment-1196497300 also mentions interest in instruction sampling, cache misses (which cache level?) and branch misprediction.
I meant the pre-defined hardware event (that is the PMU events ?) of perf list
. I also don't have a strong opinion about adding or not adding these events, but if we don't object to adding a few more common PMU events, we might add them so that we don't need to raise another proposal when we need them later. If possible, I would like to add these three events branch-misses, cache-misses and instructions as they are the most commonly used ones.
At present, we define the event as a constant value. I think it is extensible. If we want to support more events later, we only need to add it after this list.
I don't think it should be completely automatic, as I think users should have a way to require precision, or at least know that they aren't getting it.
Do you mean
perf_event_open.precise_ip
? We can set four different precision values for it, so if we specify the precision byCPUCyclesPrecise
event, we need to set four different events, maybe we only care about one of them? Another way is to create a new function such asSetPresice(int)
or add a precise parameter to theSetRate
function.
Right, there are lots of ways we could encode precise_ip in the API. By "automatic", I meant that CPUCycles
could automatically set the highest support precise_ip level. That is what I think we should not do because then it is not clear how to interpret the results from CPUCycles
.
Ok, so maybe we should add another function SetPMUEventPrecise(int)
. I think it's better to use a separate function to set the precise instead of a parameter of SetRate(int)
, because then we don't have to worry about the name of this function. And SetRate
also works with os-timer, but we don't need to set precision for os-timer, so it seems more reasonable to separate and name it SetPMUEventPrecise
.
There is some discussion here and on #42502 about changing the name from CPUProfile to something slightly more general. But Profile is already taken for the long-running counters like heap and contention. All these profiles are being collected in response to some kind of interrupt based on a CPU event, so the name is not entirely inaccurate. So probably it's fine to leave as is. There's not an obvious better name, and it does match StartCPUProfile etc.
A few nit-picks on the API:
runtime.SetCPUProfileRate is Hz. If we use the word Rate, it should be Hz. But we could use Period, which in time is always nanoseconds and for other events is counting how many of those events go by between samples (including one of the samples).
I proposed always using a period rather than a frequency because I think it would apply to all events, while frequency (in Hz) is only natural for time-based events. For example, while runtime.SetCPUProfileRate is in Hz, the other runtime.Set*ProfileRate functions take periods, for exactly this reason.
If the period for time-based events is in nanoseconds, then SetPeriod can take an int
(perhaps int64
). In that case, we should probably document how to use SetPeriod with a time.Duration.
OK, I think Russ's idea is great.
If the period for time-based events is in nanoseconds, then SetPeriod can take an int (perhaps int64). In that case, we should probably document how to use SetPeriod with a time.Duration.
Yes, I think it is better to make the unit of time-based events default to nanoseconds. And we need to document this point.
Based on the discussion above, I updated issuecomment-1200714095, please check again, thanks.
Another question is should we set default values? What's the default event, period and precise for an uninitialized CPUProfile type variable?
Based on the discussion above, I updated https://github.com/golang/go/issues/53286#issuecomment-1200714095, please check again, thanks.
I have a quick question about the proposed API: Why is there no Stop()
method? I see that the issue description is talking about it:
It would be better if a corresponding Stop method was exported, but it is not necessary, we can correctly stop the profiling by the cached configuration.
But I don't understand what "stop the profiling by the cached configuration" means. Sorry if I'm missing something.
But I don't understand what "stop the profiling by the cached configuration" means
I meant we can stop PMU profiling correctly with the existing pprof.StopCPUProfile()
function with some internal modifications. But we can also stop PMU profiling via a new exported Stop
method, and I think both are fine.
I think it would be a little weird to call the Start()
method on a *CPUProfile
instance and then use a package function to stop it. So I'd be in favor of adding a Stop()
method to *CPUProfile
.
Another question: #42502 (comment) proposes a NewCPUProfile
constructor for *CPUProfile
, but I don't see it in this proposal. Was it omitted for brevity, or are you suggesting to not have a constructor? I feel less strongly about this one.
Perhaps it would help to add a full example to the proposal that demonstrates how to use the new API to produce a profile?
Was it omitted for brevity,
Yes, I think this is the case.
Perhaps it would help to add a full example to the proposal that demonstrates how to use the new API to produce a profile?
The following are two examples, one for CPUTimer
event and one for CPUCycles
event.
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to `file`")
func main() {
flag.Parse()
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal("could not create CPU profile: ", err)
}
defer f.Close()
var cpuprof pprof.CPUProfile
if err := cpuprof.SetEvent(pprof.CPUTimer); err != nil {
log.Fatal("could not set CPUTimer event: ", err)
}
if err := cpuprof.SetPeriod(int64(1e9 /100)); err != nil { // assume profiling at 100HZ frequency
log.Fatal("Failed to set the profiling period: ", err)
}
if err := cpuprof.Start(); err != nil {
log.Fatal("could not start CPU profile: ", err)
}
defer pprof.StopCPUProfile() // or defer cpuprof.Stop()
}
// ... rest of the program
}
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to `file`")
func main() {
flag.Parse()
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal("could not create CPU profile: ", err)
}
defer f.Close()
var cpuprof pprof.CPUProfile
if err := cpuprof.SetEvent(pprof.CPUCycles); err != nil {
log.Fatal("could not set CPUTimer event: ", err)
}
if err := cpuprof.SetPeriod(10000); err != nil { // assume sampling every 10000 cycles
log.Fatal("Failed to set the profiling period: ", err)
}
if err := cpuprof.SetPMUEventPrecise(pprof.REQUEST0); err != nil {
log.Fatal("Failed to set the requested precision: ", err)
}
if err := cpuprof.Start(); err != nil {
log.Fatal("could not start CPU profile: ", err)
}
defer pprof.StopCPUProfile() // or defer cpuprof.Stop()
}
// ... rest of the program
}
Given that a program could reasonably have multiple CPU profiles going at once, either for different events or even just for different rates on the same event (e.g., for background profiling), I think the API should definitely have a Stop
method.
It sounds like everyone is in favor of:
// 1, No default event type or default to os-timer or cycles or the best available one ?
// 2, No default period or based on the default event ?
type CPUProfile struct { ... }
// 1, The unit of the period is related to the event type. For time-based events such as os-timer, it's nanosecond,
// for other events, it is the count of events that passed between samples.
// 2, If the parameter <= 0, reset it to 0, stop profiling.
// 3, The profiling period has a maximum threshold to prevent the profiling rate from being too high.
// When the parameter value exceeds the threshold, it will be adjusted to the maximum threshold.
// 4, The event type needs to be set first, otherwise an error is reported and returned.
// 5, If c is nil, report an error and return.
// 6, If the profiling has been started, can we reset the profiling rate by call SetRate ?
func (c *CPUProfile) SetPeriod(int64) error
// 1, If c is nil, report an error and return.
// 2, If the profiling has been started, we can't resest the source by call SetSource.
// 3, If the source is not available, report and error and return.
// 4, For PMU to count the kernel events, the CAP_PERFORM permission or the admin permission is required,
// 5, If cgo is used, only the CPUTimer event can be used.
func (c *CPUProfile) SetEvent(CPUProfileEvent) error
// 1, If the method receiver c is nil, report an error, stop.
// 2, If the profiling has been started, return nil directly.
func (c *CPUProfile) Start(w io.Writer) error
type CPUProfileEvent int
const (
CPUTimer CPUProfileEvent = iota
CPUCycles
CPUBranchMisses
CPUCacheMisses
CPUInstructions
)
What about the PMU API?
type PMUEventPrecise int
const (
ARBITRARY PMUEventPrecise = iota // 0 SAMPLE_IP can have arbitrary skid.
CONSTANT // 1 SAMPLE_IP must have constant skid.
REQUEST0 // 2 SAMPLE_IP requested to have 0 skid.
MUST0 // 3 SAMPLE_IP must have 0 skid.
)
// SetPMUEventPrecise sets the profiling precise of PMU events, which controls the amount of skid.
// 1, For other events, the function is unavailable and returns nil.
// 2, Returns an error if the requested precision is unreachable.
// 3, If this function is not explicitly called, the default precision is ARBITRARY.
func SetPMUEventPrecise(PMUEventPrecise) error
These names are not very Go-like, and I'm not sure I understand why this is a global function instead of a method on CPUProfile. Is this API still part of the proposal?
Overall that looks pretty good to me. A few comments:
// 3, The profiling period has a maximum threshold to prevent the profiling rate from being too high. // When the parameter value exceeds the threshold, it will be adjusted to the maximum threshold.
Since this is a period, this should be a minimum threshold.
What about the PMU API?
I'm not convinced precision needs to be part of an initial API, though maybe it's worth agreeing on an API now and separately deciding "do we want precision at all?"
If we have API for precision, it should definitely be a method of CPUProfile
because it controls per-profile configuration, and obviously it should use more Go-like names. I'd also be inclined to phrase it directly in terms of "skid" because that's what you're controlling.
Finally, I think the API should also include a Stop
method:
// Stop stops a running CPUProfile and flushes any pending writes.
// It returns only after all writes for the profile have completed.
// If c is not running, Stop does nothing and returns nil.
//
// It is the caller's responsibility to close the profile's Writer if appropriate.
//
// A CPUProfile can be started again after calling Stop.
func (c *CPUProfile) Stop() error
Even if we don't initially support multiple simultaneous CPU profiles, this makes it possible to support them in the future. Furthermore, it makes the new API entirely self-contained, rather than depending on a package-level function for ultimately stopping the profile. To that end, I think StopCPUProfile
should not stop any profiles started by the CPUProfile
API. Probably, StartCPUProfile
and StopCPUProfile
would become trivial wrappers around a global (unexported) CPUProfile
.
// 1, No default event type or default to os-timer or cycles or the best available one ?
I think we should default to os-timer since os-timer and cycle will have different behavior for time spent in the kernel. But given this is a new API, I could go either way as long as StartCPUProfile always uses os-timer.
// 2, No default period or based on the default event ?
IMO, providing a default period for each event would be nice. (SetEvent thus overrides previous calls to SetPeriod).
// 6, If the profiling has been started, can we reset the profiling rate by call SetRate ?
IMO, changing Rate or Event during profiling should be an error. The pprof proto can only list 1 period per sample type, plus I'm not sure why you'd want to change it?
I also agree with @aclements that there should be a Stop method and precision should be a method.
The pprof proto can only list 1 period per sample type, plus I'm not sure why you'd want to change it?
I agree that it should be an error to change Rate or Event on a running profile, but wanted to point out that you can, in effect, change the rate of a running profile by choosing a common denominator period for the profile. For example, set Profile.sample_type
to a small unit like nanoseocnds and set Profile.period
to a small value like 1. The account for the actual period by varying the Sample.value
of each sample.
If we support starting multiple profiles for the same event but with different periods (e.g., to support background profiling), then we may need to do a trick like this internally.
I'm not sure I understand why this is a global function instead of a method on CPUProfile. Is this API still part of the proposal?
Sorry, this is a mistake, it definitely should be a method of CPUProfile.
Since this is a period, this should be a minimum threshold.
Yes, updated.
Finally, I think the API should also include a Stop method:
Ok, I have added it. And yes StartCPUProfile
and StopCPUProfile
would become trivial wrappers around a global (unexported) CPUProfile.
In addition, I also updated some behavior descriptions according to @prattmic 's comments.
Finally let me discuss whether to add the SetPMUEventSkid
methods, and how to name it and the related constants. Personally I tend not to add this method and choose the smallest skid available in the implementation. Of course we will document this in the API. issuecomment-1200714095
Finally let me discuss whether to add the SetPMUEventSkid methods, and how to name it and the related constants. Personally I tend not to add this method and choose the smallest skid available in the implementation.
I agree with that. We can add it in the future if we want.
For choosing the smallest skid available, is that equivalent to SKIDRequest0? Or does that fail if it can't get at least SKIDConstant? (This is a property of perf, and we can create our own behavior either way, just curious.)
Or does that fail if it can't get at least SKIDConstant?
The kernel code for this is, unsurprisingly, confusing, but there are a lot of conditions that just look for precise_ip != 0
, and I didn't find any obvious "downgrading" code, so I think SKIDRequest0 will fail rather than downgrade to SKIDArbitrary.
A few nits:
// 4, The event type needs to be set first, otherwise an error is reported and returned.
func (c *CPUProfile) SetPeriod(int64) error
If this is the case, it seems SetEvent should explicitly reset the period, in case you call SetEvent, then SetPeriod, then SetEvent.
// 5, If c is nil, report an error and return.
func (c *CPUProfile) SetPeriod(int64) error
We don't typically specify that a nil receiver results in an error.
// 4, For PMU to count the kernel events, the CAP_PERFORM permission or the admin permission is required,
func (c *CPUProfile) SetEvent(CPUProfileEvent) error
This should probably be specified as part of the events.
42502 Proposes configurable CPU profiling via NewCPUProfile + cpuProfile.Start and the proposal has been accepted. But the proposal does not appear to approve any exported fields, nor does it support any perf event extensions. So I propose to extend this proposal to support configuring perf events.
I propose to add the following exported struct and method to the pprof package.
It would be better if a corresponding Stop method was exported, but it is not necessary, we can correctly stop the profiling by the cached configuration.
If we want to use PMU profiling in
http/pprof
andtesting
packages, then we need to make a few more changes.For
http/pprof
, we are not necessary to add a new interface, we can add some parameters to the profiling URL, and then do the corresponding profiling according to the parameters. for example:go tool pprof http://localhost:6060/debug/pprof/profile?event=cycles&&rate=10000
For the
testing
package, we need to add two options to thego test
command, one is-perfprofile perf.out
to specify the output file, and the other is-perfprofileconfig "conf"
to specify the configuration. These two options need to exist at the same time to work. The value of-perfprofileconfig
must meet the specified format:"{event=EEE rate=RRR}"
Example:go test -perfprofile perf.out -perfprofileconfig="{event=EEE rate=RRR}" -run .
The introduction of perf event helps to improve the precision of profiling, which is also beneficial for PGO in the future. I have made a preliminary implementation, see the series of patches https://go-review.googlesource.com/c/go/+/410796/5, the implementation does not make any changes to the
http/pprof
andtesting
packages.This proposal is related to #42502 and #36821.