Closed prabhuteja12 closed 2 years ago
Hi @prabhuteja12 ,
To perform robust benchmarking you can check timm repository, in particular the following file:
https://github.com/rwightman/pytorch-image-models/blob/master/benchmark.py#L244
Two things I do not see in your snippet and you can spot in timm are:
I hope this helps!
Hi @rstrudel ,
Thank you for the quick reply, and the pointers to timm.
torch.set_grad_enabled(False)
in my code. Additionally, timm uses perf_counter and so does PyTorch.
So I'm unable to figure out what is causing this massive drop in speed. If you have access to your specific benchmarking script, can you share that?
The snippet of code is based on timm benchmarking and we do not plan to release a clean version of it as of now. Table 3 refers to Segmenter models with a linear decoder, did you make sure you measured timings with this model and not the mask decoder (which is more expensive)? I will try to check this in the following weeks.
@rstrudel I think I was able to get the numbers (in that range). The data and the model need to be in float16
. Please confirm that this is what you did as well.
Thanks!
Here is the snippet of the code I used to check throughput. So indeed good call I was using mixed precision, well spotted!
Your use of Timer is probably a cleaner way of doing it, not sure I had it (at least did not know about it) at the time of the experiments.
@torch.no_grad()
def compute_throughput(model, batch_size, resolution):
torch.cuda.empty_cache()
warmup_iters = 3
num_iters = 30
timing = []
x = torch.randn(batch_size, 3, resolution, resolution, device=device)
# warmup
for _ in range(warmup_iters):
with torch.cuda.amp.autocast():
y = model(x)
MB = 1024.0 * 1024.0
torch.cuda.synchronize()
for i in range(num_iters):
memory = int(torch.cuda.max_memory_allocated() / MB)
if i == 0:
print(f"memory: {memory} MB")
start = time.time()
with torch.cuda.amp.autocast():
y = model(x)
torch.cuda.synchronize()
timing.append(time.time() - start)
timing = torch.as_tensor(timing, dtype=torch.float32)
return batch_size / timing.mean()
Great! Thank you!
Hello,
I'm trying to benchmark the speed for the ViT-tiny backbone with mask decoder. However, I'm only able to reach 54 samples per batch, even though there seems to be available memory for more. While benchmarking on a Tesla V100 32 GB using torch versions 2.0.1 and 1.13.1, I encountered the following error:
RuntimeError: upsample_bilinear2d_nhwc only supports output tensors with less than INT_MAX elements
Did you have the following error, and if so were you able to overcome it? Thank you
Hi,
Thank you for the cool work!
I see that you report images/sec, and mention the following in the paper:
I'm trying to do the same, however I'm unable to reproduce the numbers you of images/sec in the paper.
I'm using the code snippet from PyTorch as follows:
The batch size that fits on V100 for Vit-T backbone is about 140. And the above code shows a timing of 0.62 seconds. So I'm computing the total images/sec = 140/0.62 = 225.8. This is almost half the numbers in Table 3. Can you please help me with what I need to do to get the mentioned result?
Thank you!