google / perfetto

Performance instrumentation and tracing for Android, Linux and Chrome (read-only mirror of https://android.googlesource.com/platform/external/perfetto/)
https://www.perfetto.dev
Apache License 2.0
2.75k stars 343 forks source link

[Feature] Support callstack capturing for cpu scheduling events #177

Open ivberg opened 2 years ago

ivberg commented 2 years ago

It would be really nice for Perfetto to support optional call stack capturing for cpu scheduling events (sched_switch, sched_wakeup, sched_waking). This would be used to get more detail about what code is waiting or alerting threads. This should allow deeper root causing of issues, rather than just relying on process/thread & other flow / slice events being logged for the threads in question.

It appears Perfetto already has callstack support in the data format, at least for CPU profiling / callstack sampling. So likely the callstacks could be saved in the same table. https://perfetto.dev/docs/reference/trace-config-proto#PerfEventConfig.CallstackSampling https://perfetto.dev/docs/analysis/sql-tables#callstack-profilers

I don't see any reference to this callstack on cpu scheduling events currently supported for Linux FTrace: https://perfetto.dev/docs/reference/trace-config-proto#FtraceConfig

Nor this feature request is not mentioned here in this link; although it does outline there is already support for showing the process & thread for the wakeup which is nice: https://perfetto.dev/docs/data-sources/cpu-scheduling

For reference: It appears this capability exists in Linux as it already exists in LTTng both user-mode and kerne-mode, so in theory Perfetto could support this as well. See https://lttng.org/docs/v2.13/#doc-adding-context https://lttng.org/man/1/lttng-add-context/v2.13/ https://archive.eclipse.org/tracecompass/doc/stable/org.eclipse.tracecompass.doc.user/LTTng-Kernel-Analysis.html (Process Wait Analysis)

Callstack capture on scheduling events is widely/commonly used in detailed Windows perf analysis workflows. Windows supports callstack being configured on individual kernel events; and it would be nice to have similar functionality in a generic way in order to pick and choose stackwalk individual events. https://docs.microsoft.com/en-us/windows-hardware/test/wpt/stackwalk

primiano commented 2 years ago

It actually does if you combine the callstack sampler (Traced_perf) with tracepoints.

It's an extremely new feature (the all "linux.perf" data source is undocumented and still experimental, works only on the latest Android 12 version). I will leave @rsavitski to comment more, but I think you can do what you want with something like this on Linux/Android:

      data_sources: {
        config {
          name: "linux.perf"
          perf_event_config {
            timebase {
              tracepoint {
                name: "sched:sched_switch"
              }
              frequency: 50
            }
            callstack_sampling {
              kernel_frames: true
            }
          }
        }
      }

The UI support for this is still very primitive (it just aggregates everything, you have to be on the autopush channel via https://ui.perfetto.dev/#!/flags) but we are investing a lot into that so more stuff will happen over time.

ivberg commented 2 years ago

Very nice @primiano and thx for the quick reply! We will have to try this out; and of course it would be great if this were officially documented further. Can even make a stretch wish to be supported in the Perfetto UI (Record New Trace).

Questions:

  1. When you say extremely new feature, w.r.t trace capture does that imply a certain version of Android or Linux?
  2. How does the frequency fit into this? If I am understanding your example and the docs correctly, is this saying sample sched_switch stacks every ~50Hz? If so, this is ok - but we would actually want to be able to configure a callstack on for every certain type of sched event, and leave the frequency sampling for CPU Sampling.

From https://perfetto.dev/docs/reference/trace-config-proto#PerfEvents.Timebase PerfEvents.Timebase - frequency - Per-cpu sampling frequency in Hz, as requested from the kernel.

primiano commented 2 years ago

When you say extremely new feature, w.r.t trace capture does that imply a certain version of Android or Linux?

The feature itself sits on top of man perf_event_open which is a quite well established Linux kernel API (the same that the perf tool and many other use).

On Android, this depends on a daemon (traced_perf) which we introduced only in Android 12. On Linux, assuming root or elevated capabilities, it works with a recent version of perfetto (see CHANGELOG)

There are a bunch of rough edges. For instance for now the symbolizer can do only dwarf-based unwinding, so requires some debugging symbols to be in the binary.

How does the frequency fit into this?

Yeah I know where the doubt comes from. There are actually two field you can use here, period and frequency. For my example above, "period" would have been better (Which is really the classic overflow count).

The comments here say it all.

TL;DR

ivberg commented 2 years ago

@primiano Once callstack tracepoint on sched:sched_switch is configured, do you know where the stacks would show up in the trace processor SQL tables?

I ask because in theory I took an Android 12 trace both with CPU sampling configured and sched_switch callstacks. I am not where it should show up. I looked in perf_sample but either I was missing the events or they were being hidden a bit by the cpu sampling events

primiano commented 2 years ago

From https://perfetto.dev/docs/analysis/sql-tables#callstack-profilers should show up in the perf_sample table.

How big is the trace? Can you paste the config used somewhere? Any relevant error in logcat?

Note: if it was a 'user' build then the process you are trying to profile needs to be marked as debbugged or profileable in its manifest.

On 'userdebug' images instead you can profile pretty much everything (% sensitive processes like keystore or the like)

ivberg commented 2 years ago

Actually they are showing up in perf_sample which I can see once walking the stack. They seem to have the "__traceiter_sched_switch" function but there is ONLY this function. However, I cannot seem to get CPU Sampling AND sched_switch working at the same time which would be quite useful. Here is the snippet of the config

data_sources: { config { name: "linux.perf" perf_event_config { timebase { tracepoint { name: "sched:sched_switch" } frequency: 50 } callstack_sampling { kernel_frames: true } } } }

ivberg commented 2 years ago

BTW, Ideally we would want to be able to have a config that says capture EVERY sched_switch event AND also CPU sample every ~1ms or some frequency.

primiano commented 2 years ago

They seem to have the "__traceiter_sched_switch" function but there is ONLY this function

Do you mean that the callstack has only one frame? Seems weird.

However, I cannot seem to get CPU Sampling AND sched_switch working at the same time

I am not sure that the underlying syscall (perf_event_open) would support that. @rsavitski would know

ivberg commented 2 years ago

They seem to have the "__traceiter_sched_switch" function but there is ONLY this function

Do you mean that the callstack has only one frame? Seems weird.

No I just mean that's the leaf function / instruction pointer symbol when enabling tracepoint sched:sched_switch. By ONLY this function I mean there is no other different leaf functions that you get when performing CPU callstack sampling that show up in the trace.

However, I cannot seem to get CPU Sampling AND sched_switch working at the same time

I am not sure that the underlying syscall (perf_event_open) would support that. @rsavitski would know

That would be awesome if you can check @rsavitski . At the very least I hope perf_event_open can support both cpu sampling and scheduling callstacks with the same Timebase frequency or period settings. But again ideally with different Timebase settings which is much more practical

primiano commented 2 years ago

No I just mean that's the leaf function / instruction pointer symbol when enabling tracepoint sched:sched_switch. By ONLY this function I mean there is no other different leaf functions that you get when performing CPU callstack sampling that show up in the trace.

Hmm I still don't understand sorry. Can you show me a screenshot?

When you sample on sched_switch I expect to see callstack that start from the process main(), go down into whatever the process was doing at the time of the sched switch and eventually reach all the wait into the kernel until they reach schedule() and trace_sched_switch().

ivberg commented 2 years ago

Yes, that is what we see with the last function being similar to what you mentioned (line 21) image

What I am saying is that the CPU Sampling events are missing. As evidenced if you change the UI to group by Function (leaf node of the stack) then there is only 1 function present in the trace image

Here is what a normal CPU Sampling trace might look like if you grouped by function first, there would obviously be a lot of different functions (corresponding to callstack_sampling) image

Again, the problem is that I am trying to capture both CPU Sampling AND scheduling callstacks in the same trace and it isn't working well yet.

primiano commented 2 years ago

Ooooh I see. IIUC what you are saying is:

"If I stack sample on Time AND sched_switch, all the samples seem to be only coming from sched_switch. I can tell that by observing that the leafmost frame is always __traceiter_sched_switch".

Again, the problem is that I am trying to capture both CPU Sampling AND scheduling callstacks in the same trace and it isn't working well yet.

Yep I think that-s the issue.

ivberg commented 2 years ago

Yep I think that-s the issue.

Yep - that's it! Sorry if it was confusing. At least it's possible to gather sched_switch stacks at all, so we are halfway there ;) We would eventually require these stacks for meaningful critical path analysis. If the critical path (of a thread executing a scenario) includes CPU usage, which it often does - then the CPU Sampling stacks are also usually required to also provide useful data and root cause to a perf analyst.

ivberg commented 2 years ago

Also - I looked several times but could not find a way to easily distinguish CPU Sampling events in perf_sample from the sched_switch events which also show in perf_sample. There should be a way to distinguish these maybe based on the type column (or some other column).

Otherwise right now the only way to distinguish them seems to be by walking the stack via the linked stack_profile_callsite and stack_profile_frame and getting the function name for the first (leaf) frame. This would only be a heuristic though, and I would propose making the SQL API easier to distinguish between the different perf_sample types of events.

FYI @yashsharma-msft, @vibhorbhattMS

To summarize the issues we have found so far or are proposing:

  1. Cannot seem to capture both CPU Sampling stacks and sched_switch stacks in the same trace
  2. Would like to be able to have trace capture config to: sample CPU Sampling stacks at some interval (say 1ms) and separately capture stack on every other type of events like sched_switch
  3. Both CPU sample callstacks and sched_switch callstacks show up in perf_sample SQL API and we need a way to be able to distinguish them from eachother
primiano commented 2 years ago

Would like to be able to have trace capture config to: sample CPU Sampling stacks at some interval (say 1ms) and separately capture stack on every other type of events like sched_switch

Note that I believe today you can do this by instantiating two separate instances of "linux.perf" data source in the same trace (I didn't try but should work)..

Both CPU sample callstacks and sched_switch callstacks show up in perf_sample SQL API and we need a way to be able to distinguish them from eachother

Yeah I agree I think there might be some missing column here.

@rsavitski ping (2)

ivberg commented 2 years ago

I tried the following but it didn't seem to work as I only received CPU sampling events/stacks. I think the sched_switch config is good by itself as it worked separately standalone.

I didn't see any errors when sending the trace config to perfetto

data_sources: {
    config {
        name: "linux.perf"
        perf_event_config {
        timebase {
            tracepoint {
              name: "sched:sched_switch"
            }
            period: 1
        }
        }
    }
}
data_sources: {
    config {
        name: "linux.perf"
        perf_event_config {
        timebase {
            frequency: 1
        }
        callstack_sampling {
            kernel_frames: true
        }
        }
    }
}
benteitler commented 1 year ago

Was there ever a resolution to recording both scheduling events and time based sampling at the same time?