Open ivberg opened 3 years ago
It actually does if you combine the callstack sampler (Traced_perf) with tracepoints.
It's an extremely new feature (the all "linux.perf" data source is undocumented and still experimental, works only on the latest Android 12 version). I will leave @rsavitski to comment more, but I think you can do what you want with something like this on Linux/Android:
data_sources: {
config {
name: "linux.perf"
perf_event_config {
timebase {
tracepoint {
name: "sched:sched_switch"
}
frequency: 50
}
callstack_sampling {
kernel_frames: true
}
}
}
}
The UI support for this is still very primitive (it just aggregates everything, you have to be on the autopush channel via https://ui.perfetto.dev/#!/flags) but we are investing a lot into that so more stuff will happen over time.
Very nice @primiano and thx for the quick reply! We will have to try this out; and of course it would be great if this were officially documented further. Can even make a stretch wish to be supported in the Perfetto UI (Record New Trace).
Questions:
From https://perfetto.dev/docs/reference/trace-config-proto#PerfEvents.Timebase PerfEvents.Timebase - frequency - Per-cpu sampling frequency in Hz, as requested from the kernel.
When you say extremely new feature, w.r.t trace capture does that imply a certain version of Android or Linux?
The feature itself sits on top of man perf_event_open
which is a quite well established Linux kernel API (the same that the perf
tool and many other use).
On Android, this depends on a daemon (traced_perf) which we introduced only in Android 12. On Linux, assuming root or elevated capabilities, it works with a recent version of perfetto (see CHANGELOG)
There are a bunch of rough edges. For instance for now the symbolizer can do only dwarf-based unwinding, so requires some debugging symbols to be in the binary.
How does the frequency fit into this?
Yeah I know where the doubt comes from.
There are actually two field you can use here, period
and frequency
. For my example above, "period" would have been better (Which is really the classic overflow count).
The comments here say it all.
TL;DR
@primiano Once callstack tracepoint on sched:sched_switch is configured, do you know where the stacks would show up in the trace processor SQL tables?
I ask because in theory I took an Android 12 trace both with CPU sampling configured and sched_switch callstacks. I am not where it should show up. I looked in perf_sample but either I was missing the events or they were being hidden a bit by the cpu sampling events
From https://perfetto.dev/docs/analysis/sql-tables#callstack-profilers should show up in the perf_sample
table.
How big is the trace? Can you paste the config used somewhere? Any relevant error in logcat?
Note: if it was a 'user' build then the process you are trying to profile needs to be marked as debbugged or profileable in its manifest.
On 'userdebug' images instead you can profile pretty much everything (% sensitive processes like keystore or the like)
Actually they are showing up in perf_sample which I can see once walking the stack. They seem to have the "__traceiter_sched_switch" function but there is ONLY this function. However, I cannot seem to get CPU Sampling AND sched_switch working at the same time which would be quite useful. Here is the snippet of the config
data_sources: { config { name: "linux.perf" perf_event_config { timebase { tracepoint { name: "sched:sched_switch" } frequency: 50 } callstack_sampling { kernel_frames: true } } } }
BTW, Ideally we would want to be able to have a config that says capture EVERY sched_switch event AND also CPU sample every ~1ms or some frequency.
They seem to have the "__traceiter_sched_switch" function but there is ONLY this function
Do you mean that the callstack has only one frame? Seems weird.
However, I cannot seem to get CPU Sampling AND sched_switch working at the same time
I am not sure that the underlying syscall (perf_event_open) would support that. @rsavitski would know
They seem to have the "__traceiter_sched_switch" function but there is ONLY this function
Do you mean that the callstack has only one frame? Seems weird.
No I just mean that's the leaf function / instruction pointer symbol when enabling tracepoint sched:sched_switch. By ONLY this function I mean there is no other different leaf functions that you get when performing CPU callstack sampling that show up in the trace.
However, I cannot seem to get CPU Sampling AND sched_switch working at the same time
I am not sure that the underlying syscall (perf_event_open) would support that. @rsavitski would know
That would be awesome if you can check @rsavitski . At the very least I hope perf_event_open can support both cpu sampling and scheduling callstacks with the same Timebase frequency or period settings. But again ideally with different Timebase settings which is much more practical
No I just mean that's the leaf function / instruction pointer symbol when enabling tracepoint sched:sched_switch. By ONLY this function I mean there is no other different leaf functions that you get when performing CPU callstack sampling that show up in the trace.
Hmm I still don't understand sorry. Can you show me a screenshot?
When you sample on sched_switch I expect to see callstack that start from the process main(), go down into whatever the process was doing at the time of the sched switch and eventually reach all the wait into the kernel until they reach schedule() and trace_sched_switch().
Yes, that is what we see with the last function being similar to what you mentioned (line 21)
What I am saying is that the CPU Sampling events are missing. As evidenced if you change the UI to group by Function (leaf node of the stack) then there is only 1 function present in the trace
Here is what a normal CPU Sampling trace might look like if you grouped by function first, there would obviously be a lot of different functions (corresponding to callstack_sampling)
Again, the problem is that I am trying to capture both CPU Sampling AND scheduling callstacks in the same trace and it isn't working well yet.
Ooooh I see. IIUC what you are saying is:
"If I stack sample on Time AND sched_switch, all the samples seem to be only coming from sched_switch. I can tell that by observing that the leafmost frame is always __traceiter_sched_switch".
Again, the problem is that I am trying to capture both CPU Sampling AND scheduling callstacks in the same trace and it isn't working well yet.
Yep I think that-s the issue.
Yep I think that-s the issue.
Yep - that's it! Sorry if it was confusing. At least it's possible to gather sched_switch stacks at all, so we are halfway there ;) We would eventually require these stacks for meaningful critical path analysis. If the critical path (of a thread executing a scenario) includes CPU usage, which it often does - then the CPU Sampling stacks are also usually required to also provide useful data and root cause to a perf analyst.
Also - I looked several times but could not find a way to easily distinguish CPU Sampling events in perf_sample from the sched_switch events which also show in perf_sample. There should be a way to distinguish these maybe based on the type column (or some other column).
Otherwise right now the only way to distinguish them seems to be by walking the stack via the linked stack_profile_callsite and stack_profile_frame and getting the function name for the first (leaf) frame. This would only be a heuristic though, and I would propose making the SQL API easier to distinguish between the different perf_sample types of events.
FYI @yashsharma-msft, @vibhorbhattMS
To summarize the issues we have found so far or are proposing:
Would like to be able to have trace capture config to: sample CPU Sampling stacks at some interval (say 1ms) and separately capture stack on every other type of events like sched_switch
Note that I believe today you can do this by instantiating two separate instances of "linux.perf" data source in the same trace (I didn't try but should work)..
Both CPU sample callstacks and sched_switch callstacks show up in perf_sample SQL API and we need a way to be able to distinguish them from eachother
Yeah I agree I think there might be some missing column here.
@rsavitski ping (2)
I tried the following but it didn't seem to work as I only received CPU sampling events/stacks. I think the sched_switch config is good by itself as it worked separately standalone.
I didn't see any errors when sending the trace config to perfetto
data_sources: {
config {
name: "linux.perf"
perf_event_config {
timebase {
tracepoint {
name: "sched:sched_switch"
}
period: 1
}
}
}
}
data_sources: {
config {
name: "linux.perf"
perf_event_config {
timebase {
frequency: 1
}
callstack_sampling {
kernel_frames: true
}
}
}
}
Was there ever a resolution to recording both scheduling events and time based sampling at the same time?
Just wanted to add a comment here: for local usage, you should now be able to use simpleperf + the recently added Gecko import support to import perf profiles with this feature. We sill don't support this through Perfetto natively but maybe it will unblock some usecases.
It would be really nice for Perfetto to support optional call stack capturing for cpu scheduling events (sched_switch, sched_wakeup, sched_waking). This would be used to get more detail about what code is waiting or alerting threads. This should allow deeper root causing of issues, rather than just relying on process/thread & other flow / slice events being logged for the threads in question.
It appears Perfetto already has callstack support in the data format, at least for CPU profiling / callstack sampling. So likely the callstacks could be saved in the same table. https://perfetto.dev/docs/reference/trace-config-proto#PerfEventConfig.CallstackSampling https://perfetto.dev/docs/analysis/sql-tables#callstack-profilers
I don't see any reference to this callstack on cpu scheduling events currently supported for Linux FTrace: https://perfetto.dev/docs/reference/trace-config-proto#FtraceConfig
Nor this feature request is not mentioned here in this link; although it does outline there is already support for showing the process & thread for the wakeup which is nice: https://perfetto.dev/docs/data-sources/cpu-scheduling
For reference: It appears this capability exists in Linux as it already exists in LTTng both user-mode and kerne-mode, so in theory Perfetto could support this as well. See https://lttng.org/docs/v2.13/#doc-adding-context https://lttng.org/man/1/lttng-add-context/v2.13/ https://archive.eclipse.org/tracecompass/doc/stable/org.eclipse.tracecompass.doc.user/LTTng-Kernel-Analysis.html (Process Wait Analysis)
Callstack capture on scheduling events is widely/commonly used in detailed Windows perf analysis workflows. Windows supports callstack being configured on individual kernel events; and it would be nice to have similar functionality in a generic way in order to pick and choose stackwalk individual events. https://docs.microsoft.com/en-us/windows-hardware/test/wpt/stackwalk