Closed minhoolee closed 2 years ago
torch.utils.benchmark
just runs the code as it is, and measure the e2e latency.
While profiler will collect more internal performance related metrics/counter/event. The more information profiler collects, higher overhead is added.
could you turn off with_stack
and record_shapes
and give another try?
with_stack
will usually introduce some overhead , and it is sometimes non-trivial for small workload.
So would you say that this problem is a result of the profiling overhead? I tried turning off with_stack
and record_shapes
, which helps with the toy example but not another model that has additional operators in between the linear layers (the behavior is still consistent with the original toy example). Is this a limitation of torch.profiler overhead on certain models?
Unrelated, but I have been noticing a different issue where the first one or two active trials of the same code have inconsistent timings with later trials. The toy example without stack and input shape recording looks like this:
As in the original trace of the toy example, sometimes the first active trial has very little CPU Exec as well. Why are the first active trials so different from the others? I always do one warmup run, do I need to do more than one?
Glad to know disabling with_stack
helps the toy example.
Do you mind to share the other model so we can dig into it?
Unrelated, but I have been noticing a different issue where the first one or two active trials of the same code have inconsistent timings with later trials. The toy example without stack and input shape recording looks like this:
As in the original trace of the toy example, sometimes the first active trial has very little CPU Exec as well. Why are the first active trials so different from the others? I always do one warmup run, do I need to do more than one?
I can't reproduce it. I tried your toy code and run on V100, all steps' time are nearly same. What GPU card are you using? Would you please also share the screenshot of trace view with all steps?
You can also try more warmup steps and see whether it changes.
BTW, the current step time calculation logic is here. It's described below, a little complex:
Mark beginning of each step "ProfilerStep#" as steps_cpu[i_step][0]; Mark end of each step "ProfilerStep#" as steps_cpu[i_step][1]
If run without GPU, each step interval is [steps_cpu[i_step][0], steps_cpu[i_step][1]].
If run on GPU:
1. For each step, get all GPU executes(kernel/memcpy/memset) which are launched by this step. Mark the earliest start time as steps_device[i_step][0], the latest ending time as steps_device[i_step][1]
2. Loop the steps from smaller id to bigger id:
Mark each step's end time steps[i_step][1] as max(steps_cpu[i_step][1], steps_device[i_step][1]).
If i_step > 0:
Mark each step's start time steps[i_step][0] as max(steps[i_step-1][1], steps_cpu[i_step][0]).
else: # i_step==0
steps[i_step][0] = max(prev_step_end_time, steps_cpu[i_step][0]). prev_step_end_time is defined as max(max ending time of kernel/memcpy/meset which are launched before first step, steps_cpu[i_step][0])
Unrelated, but I have been noticing a different issue where the first one or two active trials of the same code have inconsistent timings with later trials. The toy example without stack and input shape recording looks like this: As in the original trace of the toy example, sometimes the first active trial has very little CPU Exec as well. Why are the first active trials so different from the others? I always do one warmup run, do I need to do more than one?
I can't reproduce it. I tried your toy code and run on V100, all steps' time are nearly same. What GPU card are you using? Would you please also share the screenshot of trace view with all steps?
You can also try more warmup steps and see whether it changes.
BTW, the current step time calculation logic is here. It's described below, a little complex:
Mark beginning of each step "ProfilerStep#" as steps_cpu[i_step][0]; Mark end of each step "ProfilerStep#" as steps_cpu[i_step][1] If run without GPU, each step interval is [steps_cpu[i_step][0], steps_cpu[i_step][1]]. If run on GPU: 1. For each step, get all GPU executes(kernel/memcpy/memset) which are launched by this step. Mark the earliest start time as steps_device[i_step][0], the latest ending time as steps_device[i_step][1] 2. Loop the steps from smaller id to bigger id: Mark each step's end time steps[i_step][1] as max(steps_cpu[i_step][1], steps_device[i_step][1]). If i_step > 0: Mark each step's start time steps[i_step][0] as max(steps[i_step-1][1], steps_cpu[i_step][0]). else: # i_step==0 steps[i_step][0] = max(prev_step_end_time, steps_cpu[i_step][0]). prev_step_end_time is defined as max(max ending time of kernel/memcpy/meset which are launched before first step, steps_cpu[i_step][0])
Sure, I ran the two versions of the toy code again and confirmed that they're reproducible for me with a Telsa T4. I'm running this on an EC2 instance g4dn.4xlarge: https://aws.amazon.com/ec2/instance-types/g4/ (scroll down to chart for details).
To be clear, I'm using PyTorch 1.9.0 and Kineto built from plugin/0.2 to fix https://github.com/pytorch/kineto/issues/299. I don't always see this behavior with every model / run, but it's not uncommon.
Toy example with stack and input shape recording:
Toy example without stack and input shape recording:
Glad to know disabling
with_stack
helps the toy example. Do you mind to share the other model so we can dig into it?
Unfortunately, I can't share the other model because it's confidential. I'm trying to replicate the profiling issues with the other model with these toy examples. Here is another toy example that experiences the original issue again. This time, I added a GRU layer in between each linear layer. The dimensions and data are once again arbitrary.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models
import torch.utils.benchmark as benchmark
class ProblematicLayers(nn.Module):
def __init__(self, in_features, nb_classes):
super(ProblematicLayers, self).__init__()
self.fc1 = nn.Linear(in_features, nb_classes)
self.gru = nn.GRU(input_size=in_features, hidden_size=in_features)
self.h_0 = torch.zeros(1, 32, in_features).cuda()
def forward(self, x):
for i in range(512):
out = self.fc1(x)
input = x.unsqueeze(0) # [1 x batch_size x in_features]
input = input.expand(1, -1, -1) # [length x batch_size x in_features]
output, h_n = self.gru(input, self.h_0)
return F.softmax(out, dim=1)
device = torch.device("cuda:0")
model = torch.nn.Sequential(
torchvision.models.resnet18(pretrained=True),
ProblematicLayers(1000, 1000),
)
model = model.to(device)
data = torch.zeros((32, 3, 224, 224), device=device)
def run_model(model, data):
outputs = model(data)
num_threads = torch.get_num_threads()
t = benchmark.Timer(
stmt = 'run_model(model, data)',
setup = 'from __main__ import run_model',
globals={'model': model, 'data': data},
num_threads=num_threads,
label="Average Inference Duration",
)
print(t.timeit(100))
model.eval()
with torch.no_grad():
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=0, warmup=1, active=5),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/minimal_issue'),
# record_shapes=True,
# with_stack=True
) as prof:
for iter in range(1 + 5):
run_model(model, data)
prof.step() # Need to call this at the end of each step to notify profiler of steps' boundary.
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb414411828>
Average Inference Duration
setup: from __main__ import run_model
163.65 ms
1 measurement, 100 runs , 8 threads
@minhoolee I found the "Est. Achieved Occupancy" is negative number. However, I can't reproduce this bug in my side. Would you please share your traced json file to me? Thanks a lot! My email is tegao@microsoft.com
@minhoolee Thanks for your valuable feedback! Let's analyze this interesting phenomenon of why profiler step is slower than step using benchmark.timeit. I found it is due to overhead brought by profiler's recording before each op and after each op. The prerequisite knowledge for this problem is knowing that PyTorch ops on CPU side and kernels on GPU side are asynchronized executed. That is, CPU side op launches a kernel, and this kernel is queued to GPU. No matter whether the kernel is scheduled to run on GPU or waiting for previous kernel to finish, CPU side will immediately go on executing next ops and launching following kernels.
When we set batch size as small as 32, most kernels run on GPU quickly finish, and the dominate bottleneck is CPU side's launching ops. And the overhead of instrumenting all ops is explicitly seen. When we increase batch size bigger to 256, most kernels' execution time on GPU is longer than its launched op's duration on CPU side. So GPU side kernel time is dominate bottleneck. The the overhead of instrumenting all ops will be "hidden behind" the overlapped kernel execution.
My experiemnt reusing your 2 copy of toy codes:
Let's further analyze it by checking trace. Take 1st toy model as an example. When batch size is 32, from below we can see most red lines are "vertical", which means kernels execute fast enough and don't overlap with following CPU side ops. When batch size is 256, from below we can see most red lines are "angled", which means kernels execute longer than its launched CPU side op and overlap with following CPU side ops.
When batch size is 32, from below we can see major of time GPU utilization is low as about 10%~20%. When batch size is 256, from below we can see major of time GPU utilization is high as about 99%.
In the batch 32 case, when we enlarge it, we can see the kernel duration is so "short" that it is completely hidden inside its launching CPU op. It will make CPU side op's duration dominate whole program time cost, which makes profiler's overhead on CPU side explicitly. In the batch 256 case, when we enlarge it, we can see the kernel duration is "long" enough that it is delayed executed after its launching op and overlapped with following CPU side ops.
In a word, if the GPU side kernel duration is longer than CPU side op duration(by increasing batch size, or using ops with "fat" workload kernels, or using less powerful GPU card), the step time will be dominated by kernel time, and CPU side op time will be "hidden inside" overlapped kernel time. Then the overhead brought by profiling will also be hidden along with CPU side op.
@gaoteng-git Thanks for the in-depth analysis! I suspected that profiling overhead was throttling the GPU, and it's great to see that there is a workaround by increasing the batch size. I sent you the two traces for the timing inconsistency between steps.
@minhoolee Yes, the profiling overhead is unavoidable for most profilers, especially instrumenting way. Thank you very much for your trace files! We have identified a bug from it.
@gdankel Hi, Gisle. I found a bug from minhoolee provided tracing json files. I suspected that it's due to failure of getSMCount(). It may return "-1", then the "blocks per SM" will be -125. Here is the tracing file link But the device info dumped at head of file is correct: I think maybe device info got from CUDA is more reliable than got from CUPTI for some devices? Should we change to replace it with SM count from CUDA device info?
@gdankel, it seems that getSMCount return -1 in case of no CUPTI case?
https://github.com/pytorch/kineto/blob/master/libkineto/src/CuptiActivityInterface.cpp#L93
I see, that would make sense. I'll look into it tomorrow am.
On Mon, Jun 28, 2021 at 11:33 PM Mike Guo @.***> wrote:
@gdankel https://github.com/gdankel, it seems that getSMCount return -1 in case of no CUPTI case?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/kineto/issues/325#issuecomment-870278916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHI7PPG4AGQ5Z3LX66JWBX3TVFSLLANCNFSM47KJTUOQ .
-- G
@minhoolee Yes, the profiling overhead is unavoidable for most profilers, especially instrumenting way. Thank you very much for your trace files! We have identified a bug from it.
Reducing profiler overhead is a focus area for us and also I believe for Nvidia. Cuda 11.3 is meant to improve it on their side and we have ongoing work to improve the PyTorch + Kineto side.
@gaoteng-git, @guotuofeng It may just be a matter of moving the code outside of the HAS_CUPTI guard, since it doesn't actually depend on CUPTI.
Sorry to interject, does the trace indicate that CUPTI is not installed? Is there a way to manually test this?
@minhoolee I don't think you can get those traces with the legacy profiler fallback, so I am assuming that cupti is enabled. Looking into this.
The puzzling thing here is that numSms in the deviceProperties entry for device 0 at the top of the file gets the data from the same Cuda API as we're using for the per-kernel stats. And as we can see from the kernel, the kernel's device id is 0. Either I'm missing something basic or there is some more subtle aspect to this.
Unless this is on a version that still uses the old version that @guotuofeng pointed to above: https://github.com/pytorch/kineto/blob/master/libkineto/src/CuptiActivityInterface.cpp#L66
In which case we can test by upgrading to latest version. Maybe we need to do a submodule update?
The use of CUPTI for sm count was removed in commit 39681830745276a893d6b8deb3b2b5adea10c825. @minhoolee can you check you're on the latest version / 1.9?
@gdankel My Kineto is on https://github.com/pytorch/kineto/commit/d51f60717e4eb7746e370cff09fecbc6616be25f (plugin/0.2) and I installed through source. PyTorch is 1.9.0:
>>> import torch
>>> torch.__version__
'1.9.0+cu102'
@minhoolee Should be fine. I'll see if I can repro locally.
@minhoolee Did you install pytorch1.9.0 through pip or conda? @gdankel I guess that pytorch1.9.0 package doesn't include your fix, and minhoolee installed PyTorch1.9.0 through package and tensorboard plugin through source.
@gdankel, it seems the PyTorch 1.9.0 use the old logic, which is still using kineto@a631215.
If cuptiDeviceGetNumEventDomains in getSMCount return error codes, then the domainCount would be 0? In this case, the default -1 will be returned as the smcount.
I am guessing this is the root cause of return -1 even if HAS_CUPTI is defined? is that possible?
@gaoteng-git I installed pytorch through pip for the latest version a few weeks ago.
Unrelated, but I have been noticing a different issue where the first one or two active trials of the same code have inconsistent timings with later trials. The toy example without stack and input shape recording looks like this: As in the original trace of the toy example, sometimes the first active trial has very little CPU Exec as well. Why are the first active trials so different from the others? I always do one warmup run, do I need to do more than one?
I can't reproduce it. I tried your toy code and run on V100, all steps' time are nearly same. What GPU card are you using? Would you please also share the screenshot of trace view with all steps? You can also try more warmup steps and see whether it changes. BTW, the current step time calculation logic is here. It's described below, a little complex:
Mark beginning of each step "ProfilerStep#" as steps_cpu[i_step][0]; Mark end of each step "ProfilerStep#" as steps_cpu[i_step][1] If run without GPU, each step interval is [steps_cpu[i_step][0], steps_cpu[i_step][1]]. If run on GPU: 1. For each step, get all GPU executes(kernel/memcpy/memset) which are launched by this step. Mark the earliest start time as steps_device[i_step][0], the latest ending time as steps_device[i_step][1] 2. Loop the steps from smaller id to bigger id: Mark each step's end time steps[i_step][1] as max(steps_cpu[i_step][1], steps_device[i_step][1]). If i_step > 0: Mark each step's start time steps[i_step][0] as max(steps[i_step-1][1], steps_cpu[i_step][0]). else: # i_step==0 steps[i_step][0] = max(prev_step_end_time, steps_cpu[i_step][0]). prev_step_end_time is defined as max(max ending time of kernel/memcpy/meset which are launched before first step, steps_cpu[i_step][0])
Sure, I ran the two versions of the toy code again and confirmed that they're reproducible for me with a Telsa T4. I'm running this on an EC2 instance g4dn.4xlarge: https://aws.amazon.com/ec2/instance-types/g4/ (scroll down to chart for details).
To be clear, I'm using PyTorch 1.9.0 and Kineto built from plugin/0.2 to fix #299. I don't always see this behavior with every model / run, but it's not uncommon.
Toy example with stack and input shape recording:
Toy example without stack and input shape recording:
@gaoteng-git To be clear, do you think that fixing getSMCount and Est Achieved Occupancy will fix this issue that I was observing earlier? I could try producing the traces again with the updated Kineto submodule
Yes. The fix is to call "smCount" from CudaDeviceProperties.cpp instead of the from CuptiActivityInterface.cpp. This fix is done after PyTorch 1.9 release. So PyTorch nightly build could work, the PyTorch 1.9 release not.
hi @gaoteng-git I had the same problem. When analyzing model performance, the overhead of profiler on CPU is large with batch size = 1, which makes analysis tricky. I found the exact overhead number : 4 us per op on CUDA, is there any number for the overhead on CPU ?
hi @gaoteng-git I had the same problem. When analyzing model performance, the overhead of profiler on CPU is large with batch size = 1, which makes analysis tricky. I found the exact overhead number : 4 us per op on CUDA, is there any number for the overhead on CPU ?
CPU overhead is a few hundred nanoseconds to a microsecond, depending on what you're measuring. We've put a lot of work into reducing the overhead recently, but for small dispatch limited models some slowdown is expected. We can't do much about the GPU overhead; that comes from Cupti.
CPU overhead is a few hundred nanoseconds to a microsecond, depending on what you're measuring. We've put a lot of work into reducing the overhead recently, but for small dispatch limited models some slowdown is expected. We can't do much about the GPU overhead; that comes from Cupti.
Thank you @robieta . If only measure op time without shape, stack ..., what's the overhead ? will the overhead on CPU be counted into the executing time of the operator on CPU?
If only measure op time without shape, stack ..., what's the overhead ?
Last time I benchmarked it was ~250-350ns for op profiling (something like 100ns is observer machinery that you'd get just from registering a no-op observer) and ~200ns for python functions. (And the latter is due to a recent known regression. Once that's fixed it is more like 100ns.)
will the overhead on CPU be counted into the executing time of the operator on CPU?
Times are measured late in the start callback and the end callback does almost no work, so we attribute as little of the profiling overhead to the op as possible. (Although of course that means that if foo()
calls bar()
the time to profile bar
is counted in foo
's time.)
@robieta Thanks for your detailed explanation.
250-350ns for op profiling
But as described in https://github.com/pytorch/pytorch/issues/82793, the profiler overhead on CPU seems to be very large.
Let's continue this discussion there.
Closing as the discussion is moving to pytorch#82793.
I'm trying to use torch.profiler to profile the inference time of a trained model. In order to assess the accuracy of torch.profiler, I also profile the inference time using torch.utils.benchmark.
However, the torch.utils.benchmark time is much lower than the reported torch.profiler step time, seemingly due to abnormally high "CPU Exec" time reported with the tensorboard plugin. This torch.utils.benchmark time is consistent with manual start and end time.time() and torch.cuda.synchronize().
I have produced a reproducible toy example to illustrate this behavior. The high CPU Exec time is introduced by a large number of consecutive linear layers. I run the profiler with 1 warmup, 5 active trials, and shape and stack recording.
torch.utils.benchmark timing over 100 trials:
Overview
Zoomed in trace of step 4
Zoomed in trace of step 4 (with dependencies marked)
Note: second row of green CPU operators are aten::linear, stream 7 of orange GPU operators are volta_sgemm_128x32_tn.