code(Example code in colossalai inference readme):
import torch
import transformers
import colossalai
from colossalai.inference import InferenceEngine, InferenceConfig
from pprint import pprint
colossalai.launch_from_torch()
model_path = "lmsys/vicuna-7b-v1.3"
model = transformers.LlamaForCausalLM.from_pretrained(model_path).cuda()
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
inference_config = InferenceConfig(
dtype=torch.float16,
max_batch_size=4,
max_input_len=1024,
max_output_len=512,
use_cuda_kernel=True,
)
engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)
prompts = ['Who is the best player in the history of NBA?']
response = engine.generate(prompts)
pprint(response)
run command:
colossalai run --nproc_per_node 1 speed.py
Output:
/data/miniconda/envs/torch/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/data/coding/ColossalAI/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
[11/04/24 11:04:32] INFO colossalai - colossalai - INFO:
/data/coding/ColossalAI/colossalai/initialize.py:75
launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 1
/data/miniconda/envs/torch/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.83s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
/data/miniconda/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[extension] Time taken to load inference_ops_cuda op: 0.16129255294799805 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.001485586166381836 seconds
[11/04/24 11:05:06] WARNING colossalai - colossalai.inference.utils - WARNING:
/data/coding/ColossalAI/colossalai/inference/utils.
py:162 can_use_flash_attn2
WARNING colossalai - colossalai.inference.utils - WARNING:
flash_attn2 has not been installed yet, we will use
triton flash attn instead.
[11/04/24 11:05:06] INFO colossalai - colossalai.inference.core.llm_engine -
INFO:
/data/coding/ColossalAI/colossalai/inference/core/l
lm_engine.py:158 init_model
INFO colossalai - colossalai.inference.core.llm_engine -
INFO: the device is cuda:0
INFO colossalai - colossalai.inference.core.llm_engine -
INFO:
/data/coding/ColossalAI/colossalai/inference/core/l
lm_engine.py:163 init_model
INFO colossalai - colossalai.inference.core.llm_engine -
INFO: Before the shard, Rank: [0], model size:
12.551277160644531 GB, model's device is: cuda:0
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0019431114196777344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009531974792480469 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007824897766113281 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007727146148681641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007011890411376953 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008337497711181641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0006923675537109375 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007014274597167969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007956027984619141 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0006723403930664062 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007219314575195312 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007529258728027344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.00080108642578125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010461807250976562 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007071495056152344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007612705230712891 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007638931274414062 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009360313415527344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010635852813720703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008685588836669922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010421276092529297 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008721351623535156 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009806156158447266 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008914470672607422 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010721683502197266 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008542537689208984 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008599758148193359 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008606910705566406 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008687973022460938 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009608268737792969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008566379547119141 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009701251983642578 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008494853973388672 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009267330169677734 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010409355163574219 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009996891021728516 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009884834289550781 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010025501251220703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.001371622085571289 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008530616760253906 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008502006530761719 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008380413055419922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010218620300292969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008378028869628906 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008902549743652344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008327960968017578 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008392333984375 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008347034454345703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008482933044433594 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008289813995361328 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008499622344970703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008337497711181641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008308887481689453 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008511543273925781 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008406639099121094 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008447170257568359 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008463859558105469 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007150173187255859 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007104873657226562 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007312297821044922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007383823394775391 seconds
[11/04/24 11:05:08] INFO colossalai - colossalai.inference.core.llm_engine -
INFO:
/data/coding/ColossalAI/colossalai/inference/core/l
lm_engine.py:193 init_model
INFO colossalai - colossalai.inference.core.llm_engine -
INFO: After the shard, Rank: [0], model size:
12.551277160644531 GB, model's device is: cuda:0
INFO colossalai - colossalai.inference.core.llm_engine -
INFO:
/data/coding/ColossalAI/colossalai/inference/core/l
lm_engine.py:208 init_model
INFO colossalai - colossalai.inference.core.llm_engine -
INFO: Rank [0], Model Weight Max Occupy 2.33984375
GB, Model size: 12.551277160644531 GB
[11/04/24 11:05:08] INFO colossalai -
colossalai.inference.kv_cache.kvcache_manager -
INFO:
/data/coding/ColossalAI/colossalai/inference/kv_cac
he/kvcache_manager.py:98 __init__
INFO colossalai -
colossalai.inference.kv_cache.kvcache_manager -
INFO: Allocating K cache with shape: (384, 32, 16,
16, 8), V cache with shape: (384, 32, 16, 128)
consisting of 384 blocks.
INFO colossalai -
colossalai.inference.kv_cache.kvcache_manager -
INFO:
/data/coding/ColossalAI/colossalai/inference/kv_cac
he/kvcache_manager.py:115 __init__
INFO colossalai -
colossalai.inference.kv_cache.kvcache_manager -
INFO: Allocated 3.00 GB of KV cache on device
cuda:0.
[]
====== Training on All Nodes =====
127.0.0.1: success
====== Stopping All Nodes =====
127.0.0.1: finish
Is there an existing issue for this bug?
π Describe the bug
Git commit: 2f583c1549(Current master branch)
code(Example code in colossalai inference readme):
run command:
colossalai run --nproc_per_node 1 speed.py
Output:
Environment
pytorch=2.3.1 python=3.10
nvidia-smi V100 32G, with CUDA=12.4