irthomasthomas / undecidability

4 stars 2 forks source link

PyTorch Profiler — PyTorch Tutorials 2.2.1+cu121 documentation #690

Open irthomasthomas opened 5 months ago

irthomasthomas commented 5 months ago

PyTorch Profiler — PyTorch Tutorials 2.2.1+cu121 documentation

DESCRIPTION:
PyTorch Profiler

This recipe explains how to use PyTorch profiler and measure the time and memory consumption of the model’s operators.

Introduction

PyTorch includes a simple profiler API that is useful when user needs to determine the most expensive operators in the model.

In this recipe, we will use a simple Resnet model to demonstrate how to use profiler to analyze model performance.

Setup

To install torch and torchvision use the following command:

pip install torch torchvision

Steps

  1. Import all necessary libraries

In this recipe we will use torch, torchvision.models and profiler modules:

import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
  1. Instantiate a simple Resnet model

Let’s create an instance of a Resnet model and prepare an input for it:

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
  1. Using profiler to analyze execution time

PyTorch profiler is enabled through the context manager and accepts a number of parameters, some of the most useful are:

Note: when using CUDA, profiler also shows the runtime CUDA events occurring on the host.

Let’s see how we can use profiler to analyze the execution time:

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        model(inputs)

Note that we can use record_function context manager to label arbitrary code ranges with user provided names (model_inference is used as a label in the example above).

Profiler allows one to check which operators were called during the execution of a code range wrapped with a profiler context manager. If multiple profiler ranges are active at the same time (e.g. in parallel PyTorch threads), each profiling context manager tracks only the operators of its corresponding range. Profiler also automatically profiles the asynchronous tasks launched with torch.jit._fork and (in case of a backward pass) the backward pass operators launched with backward() call.

Let’s print out the stats for the execution above:

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

The output will look like (omitting some columns):

# ---------------------------------  ------------  ------------  ------------  ------------
#                              Name      Self CPU     CPU total  CPU time avg    # of Calls
# ---------------------------------  ------------  ------------  ------------  ------------
#                   model_inference       5.509ms      57.503ms      57.503ms             1
#                      aten::conv2d     231.000us      31.931ms       1.597ms            20
#                 aten::convolution     250.000us      31.700ms       1.585ms            20
#                aten::_convolution     336.000us      31.450ms       1.573ms            20
#          aten::mkldnn_convolution      30.838ms      31.114ms       1.556ms            20
#                  aten::batch_norm     211.000us      14.693ms     734.650us            20
#      aten::_batch_norm_impl_index     319.000us      14.482ms     724.100us            20
#           aten::native_batch_norm       9.229ms      14.109ms     705.450us            20
#                        aten::mean     332.000us       2.631ms     125.286us            21
#                      aten::select       1.668ms       2.292ms       8.988us           255
# ---------------------------------  ------------  ------------  ------------  ------------
# Self CPU time total: 57.549m

Here we see that, as expected, most of the time is spent in convolution (and specifically in mkldnn_convolution for PyTorch compiled with MKL-DNN support). Note the difference between self cpu time and cpu time - operators can call other operators, self cpu time excludes time spent in children operator calls, while total cpu time includes it. You can choose to sort by the self cpu time by passing sort_by="self_cpu_time_total" into the table call.

To get a finer granularity of results and include operator input shapes, pass group_by_input_shape=True (note: this requires running the profiler with record_shapes=True):

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total", row_limit=10))

URL: PyTorch Profiler Recipe

Suggested labels

irthomasthomas commented 5 months ago

Related content

690 - Similarity score: 1.0

649 - Similarity score: 0.9

498 - Similarity score: 0.88

324 - Similarity score: 0.88

625 - Similarity score: 0.88

499 - Similarity score: 0.88