guotuofeng commented 3 years ago

Profiler differentiate runs/traces feature request

Scenarios

Goals

Here are a couple of typical scenarios that data scientists would like to compare the run after doing some tweak on some baseline model to see whether there are obvious changes in new model.

Data scientists would like to compare the model performance after changing some module’s dimension. For e.g. change the nn.Linear dimension from 128x100 to 128000x100000
Data scientists would like to see the model performance comparison between the apex fp32 and fp16.
Data scientists would like to compare the model performance between ResNet 50 and ResNet 152.
Data scientists would like to compare the model performance after change some loss function. For example, data scientists change the manually written softmax/crossentropy to some built-in support loss function in PyTorch
Data scientists would like to compare the model performance after choosing some optimized operator in PyTorch. For example, some of the operators are fused into optimized one.
Data scientists have two different nn.Module implementation for same model training and would like to see which one is better during the training process.

Non-Goals

Some unusual case like comparing ResNet model with BERT model. In this case, plugin only show the top-level diff view.

Design

The design would cover the major 6 scenarios listed in above section. The plugin UI will be categorized into two modes: normal mode and diff(comparison) mode. In diff mode, the UI would look like the following to allow user to select the runs for comparison.

After users select both baseline and experimental runs and click the diff button, the diff UI will be loaded.

Overview

The overview UI will show some summary information like device/memory, GPU utilization, memory usage, steps time etc.

Category	Device	Memory	GPU	Avg Memory Usage	Peak Memory Usage	Avg Step Time(ms)
Baseline	Tesla V100-DGXS-32GB	31.74 GB	30%	1.50 GB	5.00 GB	9.3
Experimental	Tesla V100-DGXS-32GB	31.74 GB	17%	1.70 GB	4.90 GB	11.2
Delta	N/A	N/A	-13%	200.00 MB	-100.00 MB	1.9
Delta(%)	N/A	N/A	-43.33%	13.33%	-2.00%	20.43%

Diff View

In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,

We can split two examples runs in above diagrams. For the missing parts, we will leave it alone when do comparison. Note: the functional.relu is only for illustrations purpose. It has the possibility that all functional will be missing. After we align the run execution timeline in logic way, we can compare the absolute execution time for each logical part. Then we can get the following chart . The execution time match is using the critical path time, which means we should use CPU time for CPU tasks, GPU time for GPU task at most cases.

For each part, we can get the following difference line in execution order.

User can zoom in specific align parts by clicking it(exit the zoom by click blank region?). For example, the top module forward can be zoomed at submodule view in recursive way. When user select one block, for e.g. top module.forward, the detail comparison view for the selected blocks will be shown. If there are some gaps between the aligned blocks (for e.g. some unknown code like functional or pure cpu code like time.sleep ), an blank block should be inserted with name like “unknown”, which means the time should not belong to any modules. Note: we only show the diff view for nn.Module level instead of for underlying operator for simplicity purpose, because there are enormous operators which will divert users’ interest. Diff view will cover scenario 1, 3, 4, 6.

Operator/Kernel view

The operator/kernel view will show the operators/kernels summary view for baseline and experimental run. Each column is sortable , filterable. If user select specific blocks, only related stats will be shown.

Operator	Baseline Calls	Exp Calls	Delta Calls	Delta Calls%	Baseline Self Duration	Exp Self Duration	Delta Self Duration	Delta Self Duration %
aten::emtpy	100	150	50	50.00%	138	140	2	1.45%
aten::zeros	120	141	21	17.50%	72	100	28	38.89%
aten::zero_	411	531	120	29.20%	53	59	6	11.32%
aten::view	31	14	-17	-54.84%	46	84	38	82.61%

We can extend the following columns in operator view, each column will have four sub-columns :

device self-duration
device total duration
host self-duration
host total duratio

Kernel view follows the same pattern. The scenario 2 and 5 can be covered by operator view/kernel view.

Work Items

The following changes or requirements are needed for the diff view features to align the logical timeline.

Extend torch.profiler.record_function to support customized metadata. (torch/csrc/autograd/record_function_ops.cpp::record_function_enter).
Capture module parameters, size by leveraging record_function.
Add top level module trace by using global hook like PR 55354.
Trace each module in above hook. Another approach is to trace every module in in nn.Module._call_impl. In this way, we need add trace_module in nn.Module which will be set through torch.profiler.profile. In nn.Module._call_impl, call record_function for current nn.Module.
Add nn.Module name for all the module in the graph when added in nn.Module.add_module or nn.Module._setattr

Open Issues

Choose which algorithm to align the logical timeline is not determined yet . The simplest way is to use the hierarchy’s name of each module to check the identity of modules in the two runs/traces.
How to exit the zoom? By clicking on blank region or something else?

leitian commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run? (2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs. (3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature? (4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

guotuofeng commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run?

In the diff view, we only show the timeline view in module/submodule level. The reason that we don't show the difference between operator level is because there are huge operators to compare which will make the diff view much less useful from big picture view.

(2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs.

for kernel view, the comparison is shown like operator view in table format. There is no plan to show the kernel difference from timeline view since the operator view is not supported either.

(3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature?

The two trends line will be used to show the accumulated execution time at each step. By using this line, user can easily find what's the most time consuming (with the maximum slope in the trend line).

(4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

Yes, each column in the operator/kernel view is sortable.

leitian commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run?

In the diff view, we only show the timeline view in module/submodule level. The reason that we don't show the difference between operator level is because there are huge operators to compare which will make the diff view much less useful from big picture view.

(2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs.

for kernel view, the comparison is shown like operator view in table format. There is no plan to show the kernel difference from timeline view since the operator view is not supported either.

(3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature?

The two trends line will be used to show the accumulated execution time at each step. By using this line, user can easily find what's the most time consuming (with the maximum slope in the trend line).

(4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

Yes, each column in the operator/kernel view is sortable.

Thank you for your explanations!

Sorry, I didn't make myself clear about item (4): my suggestion is to let "Operator/Kernel View" put the ones that has big deltas on the top of the table by default.

guotuofeng commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run?

In the diff view, we only show the timeline view in module/submodule level. The reason that we don't show the difference between operator level is because there are huge operators to compare which will make the diff view much less useful from big picture view.

(2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs.

for kernel view, the comparison is shown like operator view in table format. There is no plan to show the kernel difference from timeline view since the operator view is not supported either.

(3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature?

The two trends line will be used to show the accumulated execution time at each step. By using this line, user can easily find what's the most time consuming (with the maximum slope in the trend line).

(4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

Yes, each column in the operator/kernel view is sortable.

Thank you for your explanations!

Sorry, I didn't make myself clear about item (4): my suggestion is to let "Operator/Kernel View" put the ones that has big deltas on the top of the table by default.

Do you mean that we pin the maximum deltas on top of each column?

leitian commented 3 years ago

I feel the design is quite comprehensive and it looks very good to me. I have a few questions and suggestions: (1) The "Diff View" section gives a diagram of the timeline views of two example runs. Is there a plan to highlight the modules/operators on the diagram, which have significant differences comparing the baseline run and the experimental run?

In the diff view, we only show the timeline view in module/submodule level. The reason that we don't show the difference between operator level is because there are huge operators to compare which will make the diff view much less useful from big picture view.

(2) I didn't find the details of how to show the difference of kernel execution in the "Diff View" section, especially for the timeline view. I feel it will be great to have a way to locate and highlight the differences of kernel executions of two runs, since in many cases ML training and inference performance is mainly determined by how kernels are executed on GPUs.

for kernel view, the comparison is shown like operator view in table format. There is no plan to show the kernel difference from timeline view since the operator view is not supported either.

(3) On the "Execution Comparison" chart, what do "Baseline Trend" and "Experimental Trend" mean? How do you expect users to use this feature?

The two trends line will be used to show the accumulated execution time at each step. By using this line, user can easily find what's the most time consuming (with the maximum slope in the trend line).

(4) For the "Operator/Kernel View" table, can we sort the operators/kernels by their differences between the baseline and experimental runs, and put the ones that has big deltas on the top of the table?

Yes, each column in the operator/kernel view is sortable.

Thank you for your explanations! Sorry, I didn't make myself clear about item (4): my suggestion is to let "Operator/Kernel View" put the ones that has big deltas on the top of the table by default.

Do you mean that we pin the maximum deltas on top of each column?

Yes.

guotuofeng commented 3 years ago

Do you mean that we pin the maximum deltas on top of each column?

Yes.

So, we can sort against the delta column by descending order by default to make the delta on top.

chaekit commented 3 years ago

In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,

I am curious how you will be extracting the comparable pieces exactly. My suggestion is to use markers(something we don't have yet). You can log marker start/end in your workload and all the events in between will be analyzed. os_signpost is an example of this in iOS https://developer.apple.com/documentation/os/3019241-os_signpost.

guotuofeng commented 3 years ago

In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,

I am curious how you will be extracting the comparable pieces exactly. My suggestion is to use markers(something we don't have yet). You can log marker start/end in your workload and all the events in between will be analyzed. os_signpost is an example of this in iOS https://developer.apple.com/documentation/os/3019241-os_signpost.

We will add module level trace and use the module name for comparison. We plan support only comparison of modules. The reason that there is no support for comparison of operators is the huge amount of operators, which will make the user lost focus.

skyline75489 commented 2 years ago

A sneak peak of this feature implemented in #369

profPlum commented 2 months ago

@skyline75489 Hey, what does "execution diff" mean? Also what units is it in? I really think we should have axis labels and/or tool tip.

pytorch / kineto

[RFC] Profiler differentiate runs/traces feature request #342

Profiler differentiate runs/traces feature request

Scenarios

Goals

Design

Overview

Diff View

Operator/Kernel view

Work Items

Open Issues