pytorch / torchdynamo

A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
BSD 3-Clause "New" or "Revised" License
1.01k stars 124 forks source link

compiled model run in v100 GPU is slower #2008

Closed stephen-youn closed 1 year ago

stephen-youn commented 1 year ago

🐛 Describe the bug

Hi, I tried bert and resnet examples in the tutorial https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/ but it ran slower with the "torch.compile" with v100 under unbuntu env i have (i.e., Linux GCRHYP3C148 4.15.0-193-generic #204-Ubuntu SMP) isn't it supposed to be faster? thanks

Error logs

No response

Minified repro

""" resnet """

import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
opt_model = torch.compile(model, backend="inductor")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
model(torch.randn(1,3,64,64))
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
opt_model(torch.randn(1,3,64,64))
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")

this runs like the following and the compiled model run 74x slower as shown below

~/project/sandbox$ python hello_torchdynamo4.py
Using cache found in /home/styoun/.cache/torch/hub/pytorch_vision_v0.10.0
estimated_ms=223.81260681152344
estimated_ms=16573.572265625

it's similar for the following bert example in the tutorial. it's 14.7x slower with the extra line "model = torch.compile(model)"

import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
model = torch.compile(model) # This is the only line of code that we changed
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
output = model(**encoded_input)
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")
williamwen42 commented 1 year ago

Different torch.compile modes may result in different performance results (e.g. torch.compile(model, mode="max-autotune")).

Also, torch.compile will generally take longer on the first pass since it needs to compile, but future passes are expected to be faster than baseline.

stephen-youn commented 1 year ago

i tried to run it twice but it was still slower. is there any suggestion to debug this? (e.g., giving particular option in compile, adding option to make trace or verbose outputs and so on).

anijain2305 commented 1 year ago

@stephen-youn Thanks for trying out torch.compile. PyTorch 2.0 compilers are JIT compiler, i.e., they compile the model on the first iteration. In your script, you are measuring the first iteration latency, and hence you are observing the high latency. I modified your script and observing better numbers on A100 GPU (the numbers are not stable, probably because we are measuring just one iteration, but the speedup is evident).

Script

import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
opt_model = torch.compile(model, backend="inductor")

# warmup
for _ in range(3):
    model(torch.randn(1,3,64,64))

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
model(torch.randn(1,3,64,64))
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")

# warmup
for _ in range(3):
    opt_model(torch.randn(1,3,64,64))

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
opt_model(torch.randn(1,3,64,64))
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")

Output

estimated_ms=1222.14990234375
estimated_ms=326.3006286621094

Please let me know if you have any other questions. Please feel free to close the bug if your question is answered.

stephen-youn commented 1 year ago

yes i also modified the code similarly and got a perf gain in v100 too. one follow-up question is what's the differences between torch.compile(model, passes={"triton-autotune":True}) and torch.compile(model, backend="inductor"). does one use triton for matmul and the other dont? what's the default matmul kernel in inductor, isn't it a triton? but it seems the default mm is set to "aten" not the "triton" (link) how can I make sure I use the triton for matmuls?

anijain2305 commented 1 year ago

@stephen-youn

Reading between the lines, it seems you are interested in mm operators. For those

stephen-youn commented 1 year ago

I tried "opt_model = torch.compile(model, passes={'triton-mm': "triton", 'triton-bmm': True})" but it crashed. so i opened an issue here (link)