pytorch / kineto

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
Other
726 stars 169 forks source link

[RFC] Profiler differentiate runs/traces feature request #342

Open guotuofeng opened 3 years ago

guotuofeng commented 3 years ago

Profiler differentiate runs/traces feature request

Scenarios

Goals

Here are a couple of typical scenarios that data scientists would like to compare the run after doing some tweak on some baseline model to see whether there are obvious changes in new model.

Non-Goals

Design

The design would cover the major 6 scenarios listed in above section. The plugin UI will be categorized into two modes: normal mode and diff(comparison) mode. In diff mode, the UI would look like the following to allow user to select the runs for comparison.

image

After users select both baseline and experimental runs and click the diff button, the diff UI will be loaded.

Overview

The overview UI will show some summary information like device/memory, GPU utilization, memory usage, steps time etc.

Category Device Memory GPU Avg Memory Usage Peak Memory Usage Avg Step Time(ms)
Baseline Tesla V100-DGXS-32GB 31.74 GB 30% 1.50 GB 5.00 GB 9.3
Experimental Tesla V100-DGXS-32GB 31.74 GB 17% 1.70 GB 4.90 GB 11.2
Delta N/A N/A -13% 200.00 MB -100.00 MB 1.9
Delta(%) N/A N/A -43.33% 13.33% -2.00% 20.43%

Diff View

In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,

image

We can split two examples runs in above diagrams. For the missing parts, we will leave it alone when do comparison. Note: the functional.relu is only for illustrations purpose. It has the possibility that all functional will be missing. After we align the run execution timeline in logic way, we can compare the absolute execution time for each logical part. Then we can get the following chart . The execution time match is using the critical path time, which means we should use CPU time for CPU tasks, GPU time for GPU task at most cases.

image

For each part, we can get the following difference line in execution order.

image

User can zoom in specific align parts by clicking it(exit the zoom by click blank region?). For example, the top module forward can be zoomed at submodule view in recursive way. When user select one block, for e.g. top module.forward, the detail comparison view for the selected blocks will be shown. If there are some gaps between the aligned blocks (for e.g. some unknown code like functional or pure cpu code like time.sleep ), an blank block should be inserted with name like “unknown”, which means the time should not belong to any modules. Note: we only show the diff view for nn.Module level instead of for underlying operator for simplicity purpose, because there are enormous operators which will divert users’ interest. Diff view will cover scenario 1, 3, 4, 6.

Operator/Kernel view

The operator/kernel view will show the operators/kernels summary view for baseline and experimental run. Each column is sortable , filterable. If user select specific blocks, only related stats will be shown.

Operator Baseline Calls Exp Calls Delta Calls Delta Calls% Baseline Self Duration Exp Self Duration Delta Self Duration Delta Self Duration %
aten::emtpy 100 150 50 50.00% 138 140 2 1.45%
aten::zeros 120 141 21 17.50% 72 100 28 38.89%
aten::zero_ 411 531 120 29.20% 53 59 6 11.32%
aten::view 31 14 -17 -54.84% 46 84 38 82.61%

We can extend the following columns in operator view, each column will have four sub-columns :

Kernel view follows the same pattern. The scenario 2 and 5 can be covered by operator view/kernel view.

Work Items

The following changes or requirements are needed for the diff view features to align the logical timeline.

Open Issues

leitian commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run? (2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs. (3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature? (4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

guotuofeng commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run?

In the diff view, we only show the timeline view in module/submodule level. The reason that we don't show the difference between operator level is because there are huge operators to compare which will make the diff view much less useful from big picture view.

(2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs.

for kernel view, the comparison is shown like operator view in table format. There is no plan to show the kernel difference from timeline view since the operator view is not supported either.

(3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature?

The two trends line will be used to show the accumulated execution time at each step. By using this line, user can easily find what's the most time consuming (with the maximum slope in the trend line).

(4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

Yes, each column in the operator/kernel view is sortable.

leitian commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run?

In the diff view, we only show the timeline view in module/submodule level. The reason that we don't show the difference between operator level is because there are huge operators to compare which will make the diff view much less useful from big picture view.

(2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs.

for kernel view, the comparison is shown like operator view in table format. There is no plan to show the kernel difference from timeline view since the operator view is not supported either.

(3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature?

The two trends line will be used to show the accumulated execution time at each step. By using this line, user can easily find what's the most time consuming (with the maximum slope in the trend line).

(4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

Yes, each column in the operator/kernel view is sortable.

Thank you for your explanations!

Sorry, I didn't make myself clear about item (4): my suggestion is to let "Operator/Kernel View" put the ones that has big deltas on the top of the table by default.

guotuofeng commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run?

In the diff view, we only show the timeline view in module/submodule level. The reason that we don't show the difference between operator level is because there are huge operators to compare which will make the diff view much less useful from big picture view.

(2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs.

for kernel view, the comparison is shown like operator view in table format. There is no plan to show the kernel difference from timeline view since the operator view is not supported either.

(3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature?

The two trends line will be used to show the accumulated execution time at each step. By using this line, user can easily find what's the most time consuming (with the maximum slope in the trend line).

(4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

Yes, each column in the operator/kernel view is sortable.

Thank you for your explanations!

Sorry, I didn't make myself clear about item (4): my suggestion is to let "Operator/Kernel View" put the ones that has big deltas on the top of the table by default.

Do you mean that we pin the maximum deltas on top of each column?

leitian commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run?

In the diff view, we only show the timeline view in module/submodule level. The reason that we don't show the difference between operator level is because there are huge operators to compare which will make the diff view much less useful from big picture view.

(2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs.

for kernel view, the comparison is shown like operator view in table format. There is no plan to show the kernel difference from timeline view since the operator view is not supported either.

(3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature?

The two trends line will be used to show the accumulated execution time at each step. By using this line, user can easily find what's the most time consuming (with the maximum slope in the trend line).

(4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

Yes, each column in the operator/kernel view is sortable.

Thank you for your explanations! Sorry, I didn't make myself clear about item (4): my suggestion is to let "Operator/Kernel View" put the ones that has big deltas on the top of the table by default.

Do you mean that we pin the maximum deltas on top of each column?

Yes.

guotuofeng commented 3 years ago

Do you mean that we pin the maximum deltas on top of each column?

Yes.

So, we can sort against the delta column by descending order by default to make the delta on top.

chaekit commented 3 years ago

In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,

I am curious how you will be extracting the comparable pieces exactly. My suggestion is to use markers(something we don't have yet). You can log marker start/end in your workload and all the events in between will be analyzed. os_signpost is an example of this in iOS https://developer.apple.com/documentation/os/3019241-os_signpost.

guotuofeng commented 3 years ago

In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,

I am curious how you will be extracting the comparable pieces exactly. My suggestion is to use markers(something we don't have yet). You can log marker start/end in your workload and all the events in between will be analyzed. os_signpost is an example of this in iOS https://developer.apple.com/documentation/os/3019241-os_signpost.

We will add module level trace and use the module name for comparison. We plan support only comparison of modules. The reason that there is no support for comparison of operators is the huge amount of operators, which will make the user lost focus.

skyline75489 commented 2 years ago

A sneak peak of this feature implemented in #369

image

profPlum commented 2 months ago

@skyline75489 Hey, what does "execution diff" mean? Also what units is it in? I really think we should have axis labels and/or tool tip.