erifan commented 2 years ago

42502 Proposes configurable CPU profiling via NewCPUProfile + cpuProfile.Start and the proposal has been accepted. But the proposal does not appear to approve any exported fields, nor does it support any perf event extensions. So I propose to extend this proposal to support configuring perf events.

I propose to add the following exported struct and method to the pprof package.

type PerfProfile {  // We already have struct Profile, name it as PerfProfile
    Event string  // perf event names, such as cycles, instructions
    Rate int  // perf event period or os-timer frequency
}

func (p *PerfProfile) Start(w io.Writer) error {}

It would be better if a corresponding Stop method was exported, but it is not necessary, we can correctly stop the profiling by the cached configuration.

If we want to use PMU profiling in http/pprofand testingpackages, then we need to make a few more changes.

For http/pprof, we are not necessary to add a new interface, we can add some parameters to the profiling URL, and then do the corresponding profiling according to the parameters. for example: go tool pprof http://localhost:6060/debug/pprof/profile?event=cycles&&rate=10000

For the testingpackage, we need to add two options to the go test command, one is -perfprofile perf.out to specify the output file, and the other is -perfprofileconfig "conf" to specify the configuration. These two options need to exist at the same time to work. The value of -perfprofileconfig must meet the specified format: "{event=EEE rate=RRR}" Example: go test -perfprofile perf.out -perfprofileconfig="{event=EEE rate=RRR}" -run .

The introduction of perf event helps to improve the precision of profiling, which is also beneficial for PGO in the future. I have made a preliminary implementation, see the series of patches https://go-review.googlesource.com/c/go/+/410796/5, the implementation does not make any changes to the http/pprof and testing packages.

This proposal is related to #42502 and #36821.

prattmic commented 2 years ago

I have a few minor questions:

What is the output format of these profiles? Still pprof, I assume (rather than perf.data)? In that case, are there special considerations we need to take for particular perf events that may not map nicely to the existing pprof format, or should everything fit nicely?
What is the plan for non-Linux OSes? https://github.com/golang/go/issues/36821#issuecomment-598361234 mentions that Darwin and Windows don't have clear PMU interfaces. Is PerfProfile only exported on Linux? Or Start returns an error on non-Linux?

erifan commented 2 years ago

What is the output format of these profiles? Still pprof, I assume (rather than perf.data)? In that case, are there special considerations we need to take for particular perf events that may not map nicely to the existing pprof format, or should everything fit nicely?

Yes, still pprof. I didn't change the output format, just replace the original signal source (os-timer) with the PMU counter. The signal handler is also basically unchanged. So pprof is fitting well. The only change is to change the original sample type from nanosecondsto count. I'd like to convert count to nanoseconds because sometimes count doesn't seem so intuitive, but there doesn't seem to be a proper way to do this conversion.

What is the plan for non-Linux OSes? proposal: runtime/pprof: add PMU-based profiles #36821 (comment) mentions that Darwin and Windows don't have clear PMU interfaces. Is PerfProfile only exported on Linux? Or Start returns an error on non-Linux?

Yes, my plan is to only support linux, if we find a appropriate method to support PMU on Windows and Darwin in the future, we will add support for them then. It may be appropriate to panic in Start before it is implemented.

nsrip-dd commented 2 years ago

Would it be possible to measure multiple events simultaneously by creating a PerfProfile for every event of interest? Alternatively, by giving PerfProfile a method like func (*PerfProfile) MeasureEvent(name string, period int) that can be called multiple times, and having all events go in the same profile?

erifan commented 2 years ago

Would it be possible to measure multiple events simultaneously by creating a PerfProfile for every event of interest? Alternatively, by giving PerfProfile a method like func (*PerfProfile) MeasureEvent(name string, period int) that can be called multiple times, and having all events go in the same profile?

My prototype implementation doesn't account for multiple events, but it's not hard to support multiple events. I can do it however I'm not a proposal approver. Thanks.

prattmic commented 2 years ago

What is the plan for non-Linux OSes?

I did some research today on other OSes today to get a sense of the capabilities.

Windows

The Windows Performance Recorder CLI supports collecting some PMU events. There is an API to control profile collection, though I believe this communicates with a centralized service that actually does profile collection. As far as I can tell, there is no direct kernel interface. Additionally, it is not possible to limit collection to a specific process as far as I can tell.

Unanswered questions:

Is it possible to collect stack traces / auxiliary information with events, or only counts?
What permissions are required to use this interface? (Given the system-wide collection, I imagine elevated privileges are required.)

Open source libraries like PCM use a custom Windows kernel driver to access PMUs.

Visual Studio has a command capable of collecting PMU events for a specific process. It is unclear to me how this works. Does it use the WPR API?

My takeaway is that there isn't really anything feasible for us to programmatically integrate with, unless I've missed something.

Darwin

Xcode Instruments supports collecting counts from PMU events. How it does so is unclear.

PCM requires a custom kernel driver, like on Windows.

The Darwin kernel has a PMC subsystem called kpc. It looks like this is exposed to userspace, but is only usable by one "blessed" process?

I suspect that there is nothing feasible here either, but I am less confident than with Windows.

FreeBSD

FreeBSD has a complete libpmc library for accessing hardware performance counters in counting or sampling mode (includes support for sending SIGPROF). Internally this uses system calls to a kernel module. I haven't looked into this in much detail, but clearly there is something feasible here.

erifan commented 2 years ago

Thanks @prattmic for the research, I don't know much about other OSes, I'll look into how to access the PMU programmatically on other OS's. I wonder if you guys @golang/windows can provide some hints for Windows?

Also I wonder if it is really difficult to access PMU on a OS, can we not support it on this OS? Just like per thread OS timer, we also only support it on Linux.

prattmic commented 2 years ago

I should have added more context to my previous comment. The purpose of my research was to get a sense of how hard we should try to design a platform-independent API (even if only Linux is supported initially). My initial take-away is that it is probably less important if Windows and Darwin will be impossible to support.

prattmic commented 2 years ago

In this proposal, you are proposing a new type PerfProfile. However, your original CL introduced the previously approved (#42502) CPUProfile type and extended it with perf event support.

I actually like that approach better than adding a new top-level perf-specific profile type for a few reasons:

Most of the events that you’d use are CPU profiles. The existing StartCPUProfile is a CPU time profile. The standard perf cycles event would be a CPU cycles profile. Similar with instructions, branch mispredictions. Even events like L1D cache misses are attributed to specific instructions, and thus I think could be reasonably considered CPU profiles.
- The main exceptions are much more obscure events that measure things like DRAM read/write bandwidth. I am not sure these would make much sense to use in a sampling mode anyways.
I think one of the most exciting parts of this proposal is the potential to automatically provide more accurate CPU profiles to everyone using StartCPUProfile/CPUProfile/-cpuprofile if the PMU is available. I think this conceptually make the most sense if the PMU options are directly in CPUProfile.
- That said, perhaps CPUProfile could just provide options for timer (ITIMER_PROF), cycles (PMU), cycles-precise (PMU+PEBS equivalent) and any more advanced usage is in another profile type.

I also think that using Set methods rather than exported fields provides more flexibility on the struct internals plus the ability to return errors, though I don’t feel too strongly about this as it does make the API below more verbose. One such possible API:

type CPUProfile struct { ... }

func (*CPUProfile) SetRate(int) error
func (*CPUProfile) SetSource(CPUProfileSource) error

type CPUProfileSource int

const (
    CPUTimer CPUProfileSource = iota
    CPUCycles
    CPUCyclesPrecise
    Custom // Configure via Sys().
)

// Returns *LinuxCPUProfile on Linux.
func (*CPUProfile) Sys() any

// LinuxCPUProfile contains Linux perf specific advanced configuration options.
type LinuxCPUProfile struct{ ... }

// Use one of the generic PERF_TYPE_HARDWARE events.
func (*LinuxCPUProfile) SetHardwareSource(LinuxCPUProfileHardwareSource) error

// Use a raw hardware event (PERF_TYPE_RAW). Format of config, config1, and config2 are microarchitecture dependent.
func (*LinuxCPUProfile) SetRawSource(config, config1, config2 uint64) error

// Additional options. e.g., setting the “precise” flag.

The idea of hiding OS-dependent details behind a Sys() method comes from os.ProcessState.Sys() and os/exec.Cmd.SysProcAttr. I’m not the biggest fan of this pattern, but it is nice to be able to provide advanced options without needing to an OS-independent API for every feature, especially given https://github.com/golang/go/issues/53286#issuecomment-1194696910 indicating that we are unlikely to support Windows or Darwin anyways.

Another big unanswered question is how events should be described. The proposed PerfProfile API says Event string // perf event names, such as cycles, instructions. Are these names as described in the documentation for perf record -e and shown in perf list?

That would be great, but presents a big problem: these names aren’t understood directly by the kernel interface.

The simple generic events like cycles and instructions are simply stringified versions of PERF_TYPE_HARDWARE, so those are simple to support. However, another event I’ve used recently for example is cycle_activity.stalls_mem_any. This is defined here (for Skylake, other files for other microarches). This JSON file is used to generate code for the userspace perf tool that describes the raw event values for this named event. libpfm4 has a similar giant table that describes this event.

There is such a huge number of possible events here that I don’t think it is feasible for the Go standard library to be in the business of understanding the names of every single event. In the proposed API above, I’ve addressed this by directly supporting the generic PERF_TYPE_HARDWARE events and for anything beyond that just providing SetRawSource that accepts PERF_TYPE_RAW config values (which would be derived from the EventCode, UMask, etc seen in the JSON file). SetRawSource is not user-friendly, but it provides just enough API for a third-party libpfm4-style package to understand the full set of possible events to help users configure it.

prattmic commented 2 years ago

I will also note that as a first-step proposal, we could just take the simple part of the hardware CPU profile API above. i.e.,

type CPUProfile struct { ... }

func (*CPUProfile) SetRate(int) error
func (*CPUProfile) SetSource(CPUProfileSource) error

type CPUProfileSource int

const (
    CPUTimer CPUProfileSource = iota
    CPUCycles
    CPUCyclesPrecise
)

That is, we add support for using a hardware cycle counter for CPU profiles (perhaps exposing a ‘precise’ option, perhaps not). Nothing more. Under the hood this uses the perf interface, but we don’t expose an API for using any other event types.

erifan commented 2 years ago

I will also note that as a first-step proposal, we could just take the simple part of the hardware CPU profile API above.

I don't have a particularly strong opinion on how to define this API, so I think your proposal is also good, not exposing the internal fields of CPUProfile gives more flexibility to our implementation. And I don't think it's necessary to support too much events, including raw hardware event?

Most of the events that you’d use are CPU profiles

Yes, I think we will only support a few common PERF_TYPE_HARDWARE events. Do we have to support too many perf events? Because in my opinion our main purpose is to improve the CPU profiling precise of the pprof, not to implement a complete perf.

Are these names as described in the documentation for perf record -e and shown in perf list?

Yes, they are. In contrast to exported constants, we don't need to export these strings, we won't support a lot of perf events.

const ( CPUTimer CPUProfileSource = iota CPUCycles CPUCyclesPrecise ) Maybe we can support a little more events. CL 410799 supported 6 events:
cycles,
instructions,
cache-references,
cache-misses,
branch-misses,
bus-cycles.
I think at least cycles, instructions, cache-misses and branch-misses are still quite important.

My initial take-away is that it is probably less important if Windows and Darwin will be impossible to support.

So this is the basis of the above discussion. Does it make sense if we can't find a proper way to access the PMU in Darwin and Windows? As far as I know, there is currently no way to access the Arm PMU on Windows, Arm has some work in progress to provide low level profiling infrastructure and tool for Windows, but it also seems to be struggling to meet our needs. So can we support the above API only on Linux? If so then what's the behavior of SetSourceon unsupported OSes? Panic or use os timer by default?

prattmic commented 2 years ago

Yes, I think we will only support a few common PERF_TYPE_HARDWARE events. Do we have to support too many perf events? Because in my opinion our main purpose is to improve the CPU profiling precise of the pprof, not to implement a complete perf.

I completely agree. I was trying to be flexible since I've seen desire for arbitrary events (such as in #36821), but I think focusing just on cycles first makes the most sense.

Maybe we can support a little more events. I think at least cycles, instructions, cache-misses and branch-misses are still quite important.

I think it is hard to decide what is important. There was a lot of discussion on #36821 about paring down the initial proposal to focus narrowly on improving CPU profiles without adding tons of extra options, and I think that is pertinent here. If there is general agreement that just cycles is valuable, then I think we start there.

So can we support the above API only on Linux? If so then what's the behavior of SetSourceon unsupported OSes? Panic or use os timer by default?

Yes, it would only be supported on Linux.

The way I envisioned this API is that NewCPUProfile() automatically selects the best available profile source. Users that know they want a specific profile source can call SetSource, which returns an error if that source is not available (either due to unsupported OS or lack of HW support). The primary use-cases I imagine for this API are:

Folks that always want the timer-based profile either for more consistency across all profiles in their fleet or they have tooling that expects a cpu-s profile.
Folks that definitely want a precise cycles profile and would rather not profile at all if that isn't available.

P.S. it's possible that (1) will prevent us from making CPUProfile/StartCPUProfile automatically use PMU profiles because it is too much of a behavior change. I think this needs more thought, but I hope it isn't a blocker because I'd love to magically provide better profiles for a new release rather than hiding them behind a method most folks won't use.

P.P.S. since we are introducing the first implementation of NewCPUProfile with this proposal, a halfway point would be for NewCPUProfile to automatically select the best available source, but StartCPUProfile to always use the OS timer. Existing code wouldn't automatically get better profiles, but StartCPUProfile would be effectively deprecated, so new code would at least get better profiles by default.

prattmic commented 2 years ago

P.S. it's possible that (1) will prevent us from making CPUProfile/StartCPUProfile automatically use PMU profiles because it is too much of a behavior change. I think this needs more thought, but I hope it isn't a blocker because I'd love to magically provide better profiles for a new release rather than hiding them behind a method most folks won't use.

I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)

prattmic commented 2 years ago

One concern that I believe @rhysh raised on #36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.

I tested with an application that does some spinning in userspace as well as performs an expensive system call.

Here are the results:

pprof (CPU timers):

sudo perf -e cycles (kernel+user):

perf -e cycles:u (user only):

My perf_event_paranoid == 2, which prevents access to kernel sampling. Simply using perf record -e cycles has the same behavior of auto-selecting user mode only.

Perhaps there is a way to keep counting in kernel mode, but only receive user stacks, which would match the pprof behavior, but honestly I doubt it since I can imagine that even getting counts with small periods could provide interesting side-channel security insights into what the kernel is doing that they'd want to stop.

(Edit: I dug through the kernel code a bit and it doesn't look like this is possible. In fact, it turns out that at least for Intel chips the PMU hardware itself has config bits for ring 0 and ring 3 counting, so the kernel doesn't even have to turn counting off/on on entry/exit.)

I think not counting is kernel mode is perhaps OK for an opt-in API, but it may be too big of a change from pprof to automatically select a PMU profile over a CPU timer.

rhysh commented 2 years ago

One concern that I believe @rhysh raised on https://github.com/golang/go/issues/36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.

I think that comment is https://github.com/golang/go/issues/36821#issuecomment-664646116 . The design of the perf_event support at the time used a file descriptor per thread, and to prevent the profiler from using many more FDs than GOMAXPROCS it took explicit action to detach the perf_event FD from the thread when making a syscall. So my comment wasn't about a limitation of the perf_event interface (though from your update today, @prattmic , that limitation seems exist!), it was about how the proposed Go runtime change chose to spend file descriptors.

Yes, a default profiler that reports syscalls as taking no time at all seems likely to confuse.

I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)

For what it's worth, the feedback on one of my early attempts to change the default profiler (https://go.dev/cl/204279) included:

Right now, in practice, everybody sets the profiling rate to 100, runtime/pprof.StartCPUProfile. [...] I'm concerned that this will confuse people used to the standard rate.

and

I expect that many people look at the sample count and mentally divide by 100 to get the time spent.

But, there's a much better reason this time.

erifan commented 2 years ago

If there is general agreement that just cycles is valuable, then I think we start there.

Ok, this can be a good start. If people feel that other events are needed in the future, it will be much simpler.

I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)

Yes, as you mentioned before, we can keep StartCPUProfile still using the OS timer so that its behavior doesn't change. For users who don't care about the unit of the sample and only care about the profiling precision, we choose the best available source for them in NewCPUProfile. For users who wish to use a definite source can use SetSourceto set the source.

One concern that I believe @rhysh raised on https://github.com/golang/go/issues/36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.

perf_event_open has an option exclude_kernel, enabling it can exclude kernel events, so my understanding is that if it is not set, the kernel events will be counted, that is, the PMU will count in the kernel execution state. I'm not sure if you are referring to this.

Perhaps there is a way to keep counting in kernel mode, but only receive user stacks, which would match the pprof behavior

If exclude_kernel is not set and the PMU keeps counting in the kernel mode, then this is feasible. As for the callchain, we collect it by stack tracing in go, so I don't think there will be any difference. Unless we want to read the FD directly to get the callchain, but there also seems to be options (exclude_callchain_kerneland exclude_callchain_user) to control whether to include the user or kernel callchain.

In my latest code, if it's pure go code, I set this option because it helps the precision a little bit. To be honest, there is no right or wrong to count the execution of the kernel or not. For example, if a kernel operation affects multiple threads (such as allocating a new heap arena), which thread is appropriate to assign this time to? But considering that to be consistent with the behavior of the OS timer, it may be reasonable to let the PMU counting the event of the kernel.

So my comment wasn't about a limitation of the perf_event interface (though from your update today, @prattmic , that limitation seems exist!), it was about how the proposed Go runtime change chose to spend file descriptors.

Thank @rhysh for raising this problem, I think we have some discussions in CL 410798 and 36821, and there seems to be no good solution for this problem. I didn't deal with this in CL 410798 because I think such cases are really rare, especially for go, where OS thread is shared by multiple goroutines. I'd like to hear more people's thoughts on whether this is an issue that needs to be dealt with.

erifan commented 2 years ago

How does the perf_event_open profiler deal with threads that aren't created by the Go runtime?

@rhysh you asked this question in https://github.com/golang/go/issues/36821#issuecomment-664646116, and this question really stumped me. With the current framework, non-go threads can only receive profiling signals from the process signal source created by setitimer, so I installed a PMU counter for the process. However the non-go thread rarely receive the profiling signals, I thought the process signal would randomly choose a thread to receive, but it doesn't seem to be so random. I looked at the code ( complete_signal) of the kernel, it seems that the main thread always gets the signal first. Since our main thread does not block the SIGPROF signal, most of the signals are received by the main thread, and the main thread has a per-thread PMU counter, so it ignores the signal. I don't know how to fix this problem.

erifan commented 2 years ago

@rsc Would you mind adding this proposal to active queue ? Thanks.

prattmic commented 2 years ago

perf_event_open has an option exclude_kernel, enabling it can exclude kernel events, so my understanding is that if it is not set, the kernel events will be counted, that is, the PMU will count in the kernel execution state. I'm not sure if you are referring to this.

Yes, that is correct. However, if /proc/sys/kernel/perf_event_paranoid >= 2, then setting exclude_kernel=false requires CAP_PERFMON or CAP_SYS_ADMIN (effectively, you must be root): https://elixir.bootlin.com/linux/v5.18.14/source/include/linux/perf_event.h#L1351

(N.B., the perf tool automatically detects this case and sets exclude_kernel=true if you are unpriviledged)

As far as I know, must Linux distributions set /proc/sys/kernel/perf_event_paranoid to 2 by default. In practice this means that very few Go programs will be able to enable kernel counting.

I was brainstorming with @cherrymui earlier and one idea that came up is that we could collect both the OS timer and PMU profiles by default. The pprof format supports multiple different sample types in one file, so putting both in one file would not be a problem. I think the main downside is that we'd approximately double the overhead because we'd have SIGPROFs for both profiles.

erifan commented 2 years ago

However, if /proc/sys/kernel/perf_event_paranoid >= 2, then setting exclude_kernel=false requires CAP_PERFMON or CAP_SYS_ADMIN (effectively, you must be root)

Well, that's a bit of a hassle. I wonder if we can do a check on this before starting the PMU, if the environment has the appropriate permissions we will allow the PMU profiling, otherwise not. I think usually the development environment and debugging environment meet these conditions.

Or we don't count the execution of the kernel, which behaves a little differently from the OS timer source, but is actually more accurate for user programs. This is also the default mode of perf.

one idea that came up is that we could collect both the OS timer and PMU profiles by default.

This is a bit tricky because different types of samples have different units, so we need to differentiate between them.

prattmic commented 2 years ago

I wonder if we can do a check on this before starting the PMU, if the environment has the appropriate permissions we will allow the PMU profiling, otherwise not. I think usually the development environment and debugging environment meet these conditions.

Certainly we can check permissions ahead of time, but I am skeptical that most development environments will have perf_event_paranoid < 2. Perhaps we should do a survey of the defaults across some common Linux distributions.

This is a bit tricky because different types of samples have different units, so we need to differentiate between them.

Yes, the pprof tool treats them as completely different, and only displays one at a time. You can see an example of this behavior by collecting a perf.data with multiple counters:

$ perf record -e cycles,instructions ./set
$ pprof perf.data                                                                                                                           
File: set
perf-version:5.17.11
perf-command:/usr/bin/perf record -e cycles,instructions ./set
Type: instructions:u_event
Duration: 5.43s, Total samples = 18125609784
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

Note that this is displaying Type: instructions:u_event.

(pprof) options
...
  sample_index              = 0                    //: [cycles:u_sample | cycles:u_event | instructions:u_sample | instructions:u_event]
...

Here are all the included types. Select cycles:u_event with pprof -sample_index=1.

erifan commented 2 years ago

but I am skeptical that most development environments will have perf_event_paranoid < 2. Perhaps we should do a survey of the defaults across some common Linux distributions.

I think you are right that the default perf_event_paranoid value is 2 and 3 or 4 on some distributions, I wonder if we can make PMU profiling a feature that requires users to give more permissions or privileged users, similar with perftool. Or don't count kernel events.

Yes, the pprof tool treats them as completely different, and only displays one at a time. You can see an example of this behavior by collecting a perf.data with multiple counters:

But I think what we want to have is one accurate event rather than two less accurate events.

erifan commented 2 years ago

Based on the discussion above, I make a summary. The proposal (by @prattmic) is this, I removed CPUCyclesPrecise, target for Linux only.

// 1, Default event type is os-timer.
// 2, Default period is provided.
type CPUProfile struct { ... }

// 1, The unit of the period is related to the event type. For time-based events such as os-timer, it's nanosecond,
//  for other events, it is the count of events that passed between samples.
// 2, If the parameter <= 0, reset it to 0, stop profiling.
// 3, The profiling period has a minimum threshold to prevent the profiling rate from being too high.
// When the parameter value exceeds the threshold, it will be adjusted to the minimum threshold.
// 4, If c is nil, report an error and return.
// 5, If the profiling has been started, calling this function will report an error, but will not affect the running profiling.
func (c *CPUProfile) SetPeriod(int64) error

// 1, If c is nil, report an error and return.
// 2, If the profiling has been started, we can't resest the event by call SetEvent.
// 3, If the source is not available, report and error and return.
// 4, The second parameter indicates whether to count kernel events. For the os-timer event, this parameter is always true.
// 5, For PMU to count the kernel events, the CAP_PERFORM permission or the admin permission is required, if the condition is not met and the second parameter is also specified as true, an error will be reported.
// 6, If cgo is used, only the CPUTimer event can be used.
// 7, The precise_ip of the PMU event is set to the maximum value available.
func (c *CPUProfile) SetEvent(CPUProfileEvent, bool) error

// SetSampleKernel sets whether to count the kernel events, pass in “true” to count the kernel events, otherwise not count.
// This function has no effect on os-timer based events, because os-timer events always count kernel events.
// To count the kernel events, the CAP_PERFORM permission or the admin permission is required.
func (c *CPUProfile) SetSampleKernel(bool) error

// 1, If the method receiver c is nil, report an error.
// 2, If the profiling has been started, report an error.
func (c *CPUProfile) Start(w io.Writer) error

// Stop stops a running CPUProfile and flushes any pending writes.
// It returns only after all writes for the profile have completed.
// If c is not running, Stop does nothing and returns nil.
//
// It is the caller's responsibility to close the profile's Writer if appropriate.
//
// A CPUProfile can be started again after calling Stop.
func (c *CPUProfile) Stop() error

type CPUProfileEvent int

const (
    CPUTimer CPUProfileEvent = iota
    CPUCycles
        CPUBranchMisses
        CPUCacheMisses
        CPUInstructions
)

There are three problems: First, FD may be depleted. But that's probably not a problem as this is very rare. Disposal method: Do not handle.

Second, for PMU to count the kernel events, perf_event_open requires the CAP_PERFORM permission or the admin permission, otherwise the behavior of the PMU will be inconsistent with the behavior of the OS timer.

Disposal method:

Ask the user to give the appropriate permissions.
Kernel events are not counted if the permissions are not satisfied.

Third, it is difficult for non-go threads to receive signals from the PMU. Disposal method: If CGO profiling is on, only os-timer events can be used.

Everyone is welcome to brainstorm to move this proposal forward, thanks.

Edit1: Add Start method. Edit2: Added some problem and behavior descriptions that I'm not sure about. Edit3: Change func (c *CPUProfile) SetRate(int) error to func (c *CPUProfile) SetPeriod(int64) error. Rename CPUProfileSource to CPUProfileEvent. Rename func (c *CPUProfile) SetSource(CPUProfileSource) error to func (c *CPUProfile) SetEvent(CPUProfileEvent) error. Add func SetPMUEventPrecise(PMUEventPrecise ) error and type PMUEventPrecise int. Add CPUBranchMisses, CPUCacheMisses and CPUInstructions. Add Disposal methods for the above three problems. Edit4: Change func SetPMUEventPrecise(PMUEventPrecise) error to func (c *CPUProfile) SetPMUEventSkid(PMUEventSkid) error Add func (c *CPUProfile) Stop(w io.Writer) error method Modified some behavior descriptions. Edit5: Modified some behavior descriptions. Changes func (c *CPUProfile) SetEvent(CPUProfileEvent) error to func (c *CPUProfile) SetEvent( CPUProfileEvent, bool) error Edit6: Add method func (c *CPUProfile) SetSampleKernel(bool) error Edit7: updated the behavior description of the Startmethod

rsc commented 2 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

prattmic commented 2 years ago

To add to the second issue in https://github.com/golang/go/issues/53286#issuecomment-1200714095 (no kernel events), I think that this difference is acceptable for an explicit API (i.e., users must explicitly call SetSource(CPUCycles)), but that it is too big of a difference to automatically change StartCPUProfile to using the PMU when available.

erifan commented 2 years ago

it is too big of a difference to automatically change StartCPUProfile to using the PMU when available.

If the second issue is not a problem, I personally think that this is not a big problem, it is just a behavioral rule. My opinion is the same as yours, StartCPUProfile uses os-timer by default, so we keep the original behavior. And leave the new API to users who need it.

erifan commented 2 years ago

Third, it is difficult for non-go threads to receive signals from the PMU. Currently I haven't found any possible solution.

For this problem I think we can do like this:

If cgo is not used or cgo profiling is not turned on, then this is not a problem.
If cgo profiling is enabled, only os-timer events can be used, treat PMU as unavailable.

rsc commented 2 years ago

It sounds like maybe people are happy with https://github.com/golang/go/issues/53286#issuecomment-1200714095?

Are there any other likely CPUProfileSource values that we expect to support? Right now it's looking like a bool, and if we had a few more it might help us understand the shape of the API better.

prattmic commented 2 years ago

I think we eventually want to support precise sampling (i.e. PEBS, CPUCyclesPrecise mentioned earlier). Though that could be considered more of a 'modifier' that could be set with a different method. I don't think it should be completely automatic, as I think users should have a way to require precision, or at least know that they aren't getting it.

https://github.com/golang/go/issues/53286#issuecomment-1196497300 also mentions interest in instruction sampling, cache misses (which cache level?) and branch misprediction. I am not necessarily convinced one way or another whether we want these specific events, as there is a huge tail of potential events, but I do think we should keep things more open than a boolean.

erifan commented 2 years ago

I don't think it should be completely automatic, as I think users should have a way to require precision, or at least know that they aren't getting it.

Do you mean perf_event_open.precise_ip? We can set four different precision values for it, so if we specify the precision by CPUCyclesPreciseevent, we need to set four different events, maybe we only care about one of them? Another way is to create a new function such as SetPresice(int) or add a precise parameter to the SetRatefunction.

https://github.com/golang/go/issues/53286#issuecomment-1196497300 also mentions interest in instruction sampling, cache misses (which cache level?) and branch misprediction.

I meant the pre-defined hardware event (that is the PMU events ?) of perf list. I also don't have a strong opinion about adding or not adding these events, but if we don't object to adding a few more common PMU events, we might add them so that we don't need to raise another proposal when we need them later. If possible, I would like to add these three events branch-misses, cache-misses and instructions as they are the most commonly used ones.

At present, we define the event as a constant value. I think it is extensible. If we want to support more events later, we only need to add it after this list.

prattmic commented 2 years ago

I don't think it should be completely automatic, as I think users should have a way to require precision, or at least know that they aren't getting it.

Do you mean perf_event_open.precise_ip? We can set four different precision values for it, so if we specify the precision by CPUCyclesPreciseevent, we need to set four different events, maybe we only care about one of them? Another way is to create a new function such as SetPresice(int) or add a precise parameter to the SetRatefunction.

Right, there are lots of ways we could encode precise_ip in the API. By "automatic", I meant that CPUCycles could automatically set the highest support precise_ip level. That is what I think we should not do because then it is not clear how to interpret the results from CPUCycles.

erifan commented 2 years ago

Ok, so maybe we should add another function SetPMUEventPrecise(int). I think it's better to use a separate function to set the precise instead of a parameter of SetRate(int), because then we don't have to worry about the name of this function. And SetRatealso works with os-timer, but we don't need to set precision for os-timer, so it seems more reasonable to separate and name it SetPMUEventPrecise.

rsc commented 2 years ago

There is some discussion here and on #42502 about changing the name from CPUProfile to something slightly more general. But Profile is already taken for the long-running counters like heap and contention. All these profiles are being collected in response to some kind of interrupt based on a CPU event, so the name is not entirely inaccurate. So probably it's fine to leave as is. There's not an obvious better name, and it does match StartCPUProfile etc.

aclements commented 2 years ago

A few nit-picks on the API:

"The meaning of Rate is related to the event type. When the event type is os-timer, it means frequency, and when the event type is cycles, it means period." I'm not sure about overloading one method to mean frequency in some cases and period in the other. I think I would rather this always take a period, possibly as a float64. For any events tied to real-time, you need to specify the base unit of the period. I assume "frequency" is intended to be in Hz, so the period for os-timer could be in "seconds" or maybe in "nanoseconds". This would also allow a frequency less than 1 Hz for very low-rate background profiling.
SetSource: The word "Source" can mean a lot of things. I think a more common term for this generally is "event", and that would match the Linux perf API.

rsc commented 2 years ago

runtime.SetCPUProfileRate is Hz. If we use the word Rate, it should be Hz. But we could use Period, which in time is always nanoseconds and for other events is counting how many of those events go by between samples (including one of the samples).

aclements commented 2 years ago

I proposed always using a period rather than a frequency because I think it would apply to all events, while frequency (in Hz) is only natural for time-based events. For example, while runtime.SetCPUProfileRate is in Hz, the other runtime.Set*ProfileRate functions take periods, for exactly this reason.

If the period for time-based events is in nanoseconds, then SetPeriod can take an int (perhaps int64). In that case, we should probably document how to use SetPeriod with a time.Duration.

erifan commented 2 years ago

OK, I think Russ's idea is great.

If the period for time-based events is in nanoseconds, then SetPeriod can take an int (perhaps int64). In that case, we should probably document how to use SetPeriod with a time.Duration.

Yes, I think it is better to make the unit of time-based events default to nanoseconds. And we need to document this point.

erifan commented 2 years ago

Based on the discussion above, I updated issuecomment-1200714095, please check again, thanks.

Another question is should we set default values? What's the default event, period and precise for an uninitialized CPUProfile type variable?

felixge commented 2 years ago

Based on the discussion above, I updated https://github.com/golang/go/issues/53286#issuecomment-1200714095, please check again, thanks.

I have a quick question about the proposed API: Why is there no Stop() method? I see that the issue description is talking about it:

It would be better if a corresponding Stop method was exported, but it is not necessary, we can correctly stop the profiling by the cached configuration.

But I don't understand what "stop the profiling by the cached configuration" means. Sorry if I'm missing something.

erifan commented 2 years ago

But I don't understand what "stop the profiling by the cached configuration" means

I meant we can stop PMU profiling correctly with the existing pprof.StopCPUProfile() function with some internal modifications. But we can also stop PMU profiling via a new exported Stop method, and I think both are fine.

felixge commented 2 years ago

I think it would be a little weird to call the Start() method on a *CPUProfile instance and then use a package function to stop it. So I'd be in favor of adding a Stop() method to *CPUProfile.

Another question: #42502 (comment) proposes a NewCPUProfile constructor for *CPUProfile, but I don't see it in this proposal. Was it omitted for brevity, or are you suggesting to not have a constructor? I feel less strongly about this one.

Perhaps it would help to add a full example to the proposal that demonstrates how to use the new API to produce a profile?

erifan commented 2 years ago

Was it omitted for brevity,

Yes, I think this is the case.

Perhaps it would help to add a full example to the proposal that demonstrates how to use the new API to produce a profile?

The following are two examples, one for CPUTimerevent and one for CPUCycles event.

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to `file`")

func main() {
    flag.Parse()
    if *cpuprofile != "" {
        f, err := os.Create(*cpuprofile)
        if err != nil {
            log.Fatal("could not create CPU profile: ", err)
        }
        defer f.Close()
        var cpuprof pprof.CPUProfile
        if err := cpuprof.SetEvent(pprof.CPUTimer); err != nil {
            log.Fatal("could not set CPUTimer event: ", err)
        }
        if err := cpuprof.SetPeriod(int64(1e9 /100)); err != nil { // assume profiling at 100HZ frequency
            log.Fatal("Failed to set the profiling period: ", err)
        }
        if err := cpuprof.Start(); err != nil {
            log.Fatal("could not start CPU profile: ", err)
        }
        defer pprof.StopCPUProfile()  // or defer cpuprof.Stop()
    }
    // ... rest of the program
}

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to `file`")

func main() {
    flag.Parse()
    if *cpuprofile != "" {
        f, err := os.Create(*cpuprofile)
        if err != nil {
            log.Fatal("could not create CPU profile: ", err)
        }
        defer f.Close()
        var cpuprof pprof.CPUProfile
        if err := cpuprof.SetEvent(pprof.CPUCycles); err != nil {
            log.Fatal("could not set CPUTimer event: ", err)
        }
        if err := cpuprof.SetPeriod(10000); err != nil { // assume sampling every 10000 cycles
            log.Fatal("Failed to set the profiling period: ", err)
        }
        if err := cpuprof.SetPMUEventPrecise(pprof.REQUEST0); err != nil {
            log.Fatal("Failed to set the requested precision: ", err)
        }
        if err := cpuprof.Start(); err != nil {
            log.Fatal("could not start CPU profile: ", err)
        }
        defer pprof.StopCPUProfile() // or defer cpuprof.Stop()
    }
    // ... rest of the program
}

aclements commented 2 years ago

Given that a program could reasonably have multiple CPU profiles going at once, either for different events or even just for different rates on the same event (e.g., for background profiling), I think the API should definitely have a Stop method.

rsc commented 2 years ago

It sounds like everyone is in favor of:

// 1, No default event type or default to os-timer or cycles or the best available one ?
// 2, No default period or based on the default event ?
type CPUProfile struct { ... }

// 1, The unit of the period is related to the event type. For time-based events such as os-timer, it's nanosecond,
//  for other events, it is the count of events that passed between samples.
// 2, If the parameter <= 0, reset it to 0, stop profiling.
// 3, The profiling period has a maximum threshold to prevent the profiling rate from being too high.
// When the parameter value exceeds the threshold, it will be adjusted to the maximum threshold.
// 4, The event type needs to be set first, otherwise an error is reported and returned.
// 5, If c is nil, report an error and return.
// 6, If the profiling has been started, can we reset the profiling rate by call SetRate ?
func (c *CPUProfile) SetPeriod(int64) error

// 1, If c is nil, report an error and return.
// 2, If the profiling has been started, we can't resest the source by call SetSource.
// 3, If the source is not available, report and error and return.
// 4, For PMU to count the kernel events, the CAP_PERFORM permission or the admin permission is required,
// 5, If cgo is used, only the CPUTimer event can be used.
func (c *CPUProfile) SetEvent(CPUProfileEvent) error

// 1, If the method receiver c is nil, report an error, stop.
// 2, If the profiling has been started, return nil directly.
func (c *CPUProfile) Start(w io.Writer) error

type CPUProfileEvent int

const (
    CPUTimer CPUProfileEvent = iota
    CPUCycles
        CPUBranchMisses
        CPUCacheMisses
        CPUInstructions
)

What about the PMU API?

type PMUEventPrecise int

const (
    ARBITRARY PMUEventPrecise = iota // 0  SAMPLE_IP can have arbitrary skid.
    CONSTANT // 1  SAMPLE_IP must have constant skid.
        REQUEST0 // 2  SAMPLE_IP requested to have 0 skid.
        MUST0 // 3  SAMPLE_IP must have 0 skid.
)

// SetPMUEventPrecise sets the profiling precise of PMU events, which controls the amount of skid.
// 1, For other events, the function is unavailable and returns nil.
// 2, Returns an error if the requested precision is unreachable.
// 3, If this function is not explicitly called, the default precision is ARBITRARY.
func SetPMUEventPrecise(PMUEventPrecise) error

These names are not very Go-like, and I'm not sure I understand why this is a global function instead of a method on CPUProfile. Is this API still part of the proposal?

aclements commented 2 years ago

Overall that looks pretty good to me. A few comments:

// 3, The profiling period has a maximum threshold to prevent the profiling rate from being too high. // When the parameter value exceeds the threshold, it will be adjusted to the maximum threshold.

Since this is a period, this should be a minimum threshold.

What about the PMU API?

I'm not convinced precision needs to be part of an initial API, though maybe it's worth agreeing on an API now and separately deciding "do we want precision at all?"

If we have API for precision, it should definitely be a method of CPUProfile because it controls per-profile configuration, and obviously it should use more Go-like names. I'd also be inclined to phrase it directly in terms of "skid" because that's what you're controlling.

Finally, I think the API should also include a Stop method:

// Stop stops a running CPUProfile and flushes any pending writes.
// It returns only after all writes for the profile have completed.
// If c is not running, Stop does nothing and returns nil.
//
// It is the caller's responsibility to close the profile's Writer if appropriate.
//
// A CPUProfile can be started again after calling Stop.
func (c *CPUProfile) Stop() error

Even if we don't initially support multiple simultaneous CPU profiles, this makes it possible to support them in the future. Furthermore, it makes the new API entirely self-contained, rather than depending on a package-level function for ultimately stopping the profile. To that end, I think StopCPUProfile should not stop any profiles started by the CPUProfile API. Probably, StartCPUProfile and StopCPUProfile would become trivial wrappers around a global (unexported) CPUProfile.

prattmic commented 2 years ago

// 1, No default event type or default to os-timer or cycles or the best available one ?

I think we should default to os-timer since os-timer and cycle will have different behavior for time spent in the kernel. But given this is a new API, I could go either way as long as StartCPUProfile always uses os-timer.

// 2, No default period or based on the default event ?

IMO, providing a default period for each event would be nice. (SetEvent thus overrides previous calls to SetPeriod).

// 6, If the profiling has been started, can we reset the profiling rate by call SetRate ?

IMO, changing Rate or Event during profiling should be an error. The pprof proto can only list 1 period per sample type, plus I'm not sure why you'd want to change it?

I also agree with @aclements that there should be a Stop method and precision should be a method.

aclements commented 2 years ago

The pprof proto can only list 1 period per sample type, plus I'm not sure why you'd want to change it?

I agree that it should be an error to change Rate or Event on a running profile, but wanted to point out that you can, in effect, change the rate of a running profile by choosing a common denominator period for the profile. For example, set Profile.sample_type to a small unit like nanoseocnds and set Profile.period to a small value like 1. The account for the actual period by varying the Sample.value of each sample.

If we support starting multiple profiles for the same event but with different periods (e.g., to support background profiling), then we may need to do a trick like this internally.

erifan commented 2 years ago

I'm not sure I understand why this is a global function instead of a method on CPUProfile. Is this API still part of the proposal?

Sorry, this is a mistake, it definitely should be a method of CPUProfile.

Since this is a period, this should be a minimum threshold.

Yes, updated.

Finally, I think the API should also include a Stop method:

Ok, I have added it. And yes StartCPUProfileand StopCPUProfilewould become trivial wrappers around a global (unexported) CPUProfile.

In addition, I also updated some behavior descriptions according to @prattmic 's comments.

Finally let me discuss whether to add the SetPMUEventSkidmethods, and how to name it and the related constants. Personally I tend not to add this method and choose the smallest skid available in the implementation. Of course we will document this in the API. issuecomment-1200714095

aclements commented 2 years ago

Finally let me discuss whether to add the SetPMUEventSkid methods, and how to name it and the related constants. Personally I tend not to add this method and choose the smallest skid available in the implementation.

I agree with that. We can add it in the future if we want.

For choosing the smallest skid available, is that equivalent to SKIDRequest0? Or does that fail if it can't get at least SKIDConstant? (This is a property of perf, and we can create our own behavior either way, just curious.)

aclements commented 2 years ago

Or does that fail if it can't get at least SKIDConstant?

The kernel code for this is, unsurprisingly, confusing, but there are a lot of conditions that just look for precise_ip != 0, and I didn't find any obvious "downgrading" code, so I think SKIDRequest0 will fail rather than downgrade to SKIDArbitrary.

aclements commented 2 years ago

A few nits:

// 4, The event type needs to be set first, otherwise an error is reported and returned.
func (c *CPUProfile) SetPeriod(int64) error

If this is the case, it seems SetEvent should explicitly reset the period, in case you call SetEvent, then SetPeriod, then SetEvent.

// 5, If c is nil, report an error and return.
func (c *CPUProfile) SetPeriod(int64) error

We don't typically specify that a nil receiver results in an error.

// 4, For PMU to count the kernel events, the CAP_PERFORM permission or the admin permission is required,
func (c *CPUProfile) SetEvent(CPUProfileEvent) error

This should probably be specified as part of the events.

golang / go

proposal: runtime/pprof: extend profile configuration to support perf events #53286

42502 Proposes configurable CPU profiling via NewCPUProfile + cpuProfile.Start and the proposal has been accepted. But the proposal does not appear to approve any exported fields, nor does it support any perf event extensions. So I propose to extend this proposal to support configuring perf events.

Windows

Darwin

FreeBSD