Open TheaperDeng opened 1 year ago
@MeouSker77 @rnwang04 @yangw1234 @jason-dai Please provide any comment you have.
What will be called in the __exit__
. If the code in __exit__
is not necessary, maybe we can just call those code in __enter__
when user first call __call__
?
E.g
class OptimizedModel:
def __call__:
if not self.initialized:
self.setup()
self.initialized = True
with torch.no_grad():
with torch.autocast(enabled=self.bf16_enabled):
# regular forward code
What will be called in the
__exit__
. If the code in__exit__
is not necessary, maybe we can just call those code in__enter__
when user first call__call__
? E.gclass OptimizedModel: def __call__: if not self.initialized: self.setup() self.initialized = True with torch.no_grad(): with torch.autocast(enabled=self.bf16_enabled): # regular forward code
Won't this code be call every time model(x)
is called?
What will be called in the
__exit__
. If the code in__exit__
is not necessary, maybe we can just call those code in__enter__
when user first call__call__
? E.gclass OptimizedModel: def __call__: if not self.initialized: self.setup() self.initialized = True with torch.no_grad(): with torch.autocast(enabled=self.bf16_enabled): # regular forward code
Won't this code be call every time
model(x)
is called?
self.setup()
will be called only on the first time model(x) is called.
For torch.no_grad()
and torch.autocast
, I don't they have very large overhead since they are often used inside a module.
For
torch.no_grad()
andtorch.autocast
, I don't they have very large overhead since they are often used inside a module.
Unfortunately, that's not the case. torch.no_grad()
and torch.autocast
brings huge (>30%) overhead especially when we are testing the online latency/throughput (batchsize=1).
Here is an example
import torch
from torchvision.models import resnet18
import time
if __name__ == "__main__":
model_ft = resnet18(pretrained=True)
model_ft.eval()
x = torch.rand(1, 3, 224, 224)
# inside the iter
st = time.time()
for _ in range(100):
with torch.no_grad():
with torch.cpu.amp.autocast():
model_ft(x)
print(time.time() - st)
# inside the iter, but the initialization is outside the iter
st = time.time()
autocast = torch.cpu.amp.autocast()
for _ in range(100):
with torch.no_grad():
with autocast:
model_ft(x)
print(time.time() - st)
# outside the iter
st = time.time()
with torch.no_grad():
with torch.cpu.amp.autocast():
for _ in range(100):
model_ft(x)
print(time.time() - st)
The time for them are on a server(limited to 8 cores, but the trend remains no matter how many cores are used)
1.2232913970947266
1.200247049331665
0.7643022537231445
What will be called in the
__exit__
. If the code in__exit__
is not necessary, maybe we can just call those code in__enter__
when user first call__call__
? E.gclass OptimizedModel: def __call__: if not self.initialized: self.setup() self.initialized = True with torch.no_grad(): with torch.autocast(enabled=self.bf16_enabled): # regular forward code
for __exit__
we will just do some recovery work, such as set back the thread num control and of course the original exit of the context manager we used.
What will be called in the
__exit__
. If the code in__exit__
is not necessary, maybe we can just call those code in__enter__
when user first call__call__
? E.gclass OptimizedModel: def __call__: if not self.initialized: self.setup() self.initialized = True with torch.no_grad(): with torch.autocast(enabled=self.bf16_enabled): # regular forward code
Won't this code be call every time
model(x)
is called?
self.setup()
will be called only on the first time model(x) is called.For
torch.no_grad()
andtorch.autocast
, I don't they have very large overhead since they are often used inside a module.
Calling autocast
inside the inference loop has ~30% overheads vs. outside the loop
Calling
autocast
inside the inference loop has ~30% overheads vs. outside the loop
In that case, I think context manager makes sense.
Can the context manger and the model mix and match? Or each model has its dedicated context manager?
InferenceOptimizer context manager proposal
Only a preview idea for discussion, will experimentally implement it once we think this is a good way to move on.
Why?
We found many BKCs (
torch.no_grad
,autocast
,set_num_threads
, instruction check, framework version check...) that users should apply or set correctly while we could not do them automatically during our optimized model's forwarding such asmodel(x)
, mostly due to the performance degradation if we keep doing it again and again when users has such a iterative calling:Still, we try our best to not let our users to set too much trivial and error-prone settings or checks. That's why we propose to have a context manager generated by
InferenceOptimizer
together with the best optimized model.How does it look like?
User will get a context manager (this manager does a lot of things, will illustrate in detail later), and use it upon all the inference code they would like to use nano optimized model.
I understand some API like
InferenceOptimizer.trace/quantize
may not have a consistent and compatible API as before, we may have some discussion such as add a parameter likemodel, context_mgr = InferenceOptimizer.trace(..., context_manager=True)
.What will the context manger do?
apply
torch.no_grad
for pytorch models.apply
autocast
for bf16 modelsapply thread_control for pytorch framework models
check if required instruction set is provided on the server (like bf16)
check if required library version is provided on the server (like we need torch>=1.12 for bf16)
(possibly) apply good thread control for openvino/onnxruntime by opening a new process for the code under this context manager
...
Since the context manager's
__enter__
is only called once, it should be ok if the latency is a little bit hight.