intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k stars 1.25k forks source link

Nano: InferenceOptimizer context manager proposal #6675

Open TheaperDeng opened 1 year ago

TheaperDeng commented 1 year ago

InferenceOptimizer context manager proposal

Only a preview idea for discussion, will experimentally implement it once we think this is a good way to move on.

Why?

We found many BKCs (torch.no_grad, autocast, set_num_threads, instruction check, framework version check...) that users should apply or set correctly while we could not do them automatically during our optimized model's forwarding such as model(x), mostly due to the performance degradation if we keep doing it again and again when users has such a iterative calling:

for data in dataloader:
    model(data)

Still, we try our best to not let our users to set too much trivial and error-prone settings or checks. That's why we propose to have a context manager generated by InferenceOptimizer together with the best optimized model.

How does it look like?

User will get a context manager (this manager does a lot of things, will illustrate in detail later), and use it upon all the inference code they would like to use nano optimized model.

inference_opt = InferenceOptimizer()
inference_opt.optimize(model, ...)
model_nano = inference_opt.get_model("bf16_ipex")
context_manager_nano = inference_opt.get_context_manager("bf16_ipex")

with context_manager_nano:  # <--- The only thing changed
    # all the code users would like to use for inference before

I understand some API like InferenceOptimizer.trace/quantize may not have a consistent and compatible API as before, we may have some discussion such as add a parameter like model, context_mgr = InferenceOptimizer.trace(..., context_manager=True).

What will the context manger do?

Since the context manager's __enter__ is only called once, it should be ok if the latency is a little bit hight.

TheaperDeng commented 1 year ago

@MeouSker77 @rnwang04 @yangw1234 @jason-dai Please provide any comment you have.

yangw1234 commented 1 year ago

What will be called in the __exit__. If the code in __exit__ is not necessary, maybe we can just call those code in __enter__ when user first call __call__? E.g


class OptimizedModel:

    def __call__:
        if not self.initialized:
             self.setup()
             self.initialized = True
        with torch.no_grad():
            with torch.autocast(enabled=self.bf16_enabled):
                # regular forward code
jason-dai commented 1 year ago

What will be called in the __exit__. If the code in __exit__ is not necessary, maybe we can just call those code in __enter__ when user first call __call__? E.g

class OptimizedModel:

    def __call__:
        if not self.initialized:
             self.setup()
             self.initialized = True
        with torch.no_grad():
            with torch.autocast(enabled=self.bf16_enabled):
                # regular forward code

Won't this code be call every time model(x) is called?

yangw1234 commented 1 year ago

What will be called in the __exit__. If the code in __exit__ is not necessary, maybe we can just call those code in __enter__ when user first call __call__? E.g

class OptimizedModel:

    def __call__:
        if not self.initialized:
             self.setup()
             self.initialized = True
        with torch.no_grad():
            with torch.autocast(enabled=self.bf16_enabled):
                # regular forward code

Won't this code be call every time model(x) is called?

self.setup() will be called only on the first time model(x) is called.

For torch.no_grad() and torch.autocast, I don't they have very large overhead since they are often used inside a module.

TheaperDeng commented 1 year ago

For torch.no_grad() and torch.autocast, I don't they have very large overhead since they are often used inside a module.

Unfortunately, that's not the case. torch.no_grad() and torch.autocast brings huge (>30%) overhead especially when we are testing the online latency/throughput (batchsize=1).

Here is an example

import torch
from torchvision.models import resnet18
import time

if __name__ == "__main__":
    model_ft = resnet18(pretrained=True)
    model_ft.eval()

    x = torch.rand(1, 3, 224, 224)

    # inside the iter
    st = time.time()
    for _ in range(100):
        with torch.no_grad():
            with torch.cpu.amp.autocast():
                model_ft(x)
    print(time.time() - st)

    # inside the iter, but the initialization is outside the iter
    st = time.time()
    autocast = torch.cpu.amp.autocast()
    for _ in range(100):
        with torch.no_grad():
            with autocast:
                model_ft(x)
    print(time.time() - st)

    # outside the iter
    st = time.time()
    with torch.no_grad():
        with torch.cpu.amp.autocast():
            for _ in range(100):
                model_ft(x)
    print(time.time() - st)

The time for them are on a server(limited to 8 cores, but the trend remains no matter how many cores are used)

1.2232913970947266
1.200247049331665
0.7643022537231445
TheaperDeng commented 1 year ago

What will be called in the __exit__. If the code in __exit__ is not necessary, maybe we can just call those code in __enter__ when user first call __call__? E.g

class OptimizedModel:

    def __call__:
        if not self.initialized:
             self.setup()
             self.initialized = True
        with torch.no_grad():
            with torch.autocast(enabled=self.bf16_enabled):
                # regular forward code

for __exit__ we will just do some recovery work, such as set back the thread num control and of course the original exit of the context manager we used.

jason-dai commented 1 year ago

What will be called in the __exit__. If the code in __exit__ is not necessary, maybe we can just call those code in __enter__ when user first call __call__? E.g

class OptimizedModel:

    def __call__:
        if not self.initialized:
             self.setup()
             self.initialized = True
        with torch.no_grad():
            with torch.autocast(enabled=self.bf16_enabled):
                # regular forward code

Won't this code be call every time model(x) is called?

self.setup() will be called only on the first time model(x) is called.

For torch.no_grad() and torch.autocast, I don't they have very large overhead since they are often used inside a module.

Calling autocast inside the inference loop has ~30% overheads vs. outside the loop

yangw1234 commented 1 year ago

Calling autocast inside the inference loop has ~30% overheads vs. outside the loop

In that case, I think context manager makes sense.

yangw1234 commented 1 year ago

Can the context manger and the model mix and match? Or each model has its dedicated context manager?