[BUG]: ColossalAI Inference example response empty result without error

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

Git commit: 2f583c1549(Current master branch)

code(Example code in colossalai inference readme):

import torch
import transformers
import colossalai
from colossalai.inference import InferenceEngine, InferenceConfig
from pprint import pprint

colossalai.launch_from_torch()

model_path = "lmsys/vicuna-7b-v1.3"
model = transformers.LlamaForCausalLM.from_pretrained(model_path).cuda()
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)

inference_config = InferenceConfig(
                dtype=torch.float16,
                max_batch_size=4,
                max_input_len=1024,
                max_output_len=512,
                use_cuda_kernel=True,
            )

engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)

prompts = ['Who is the best player in the history of NBA?']
response = engine.generate(prompts)
pprint(response)

run command:

colossalai run --nproc_per_node 1 speed.py

Output:


/data/miniconda/envs/torch/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/data/coding/ColossalAI/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
[11/04/24 11:04:32] INFO     colossalai - colossalai - INFO:                    
                             /data/coding/ColossalAI/colossalai/initialize.py:75
                             launch                                             
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 1          
/data/miniconda/envs/torch/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00,  8.83s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
/data/miniconda/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[extension] Time taken to load inference_ops_cuda op: 0.16129255294799805 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.001485586166381836 seconds
[11/04/24 11:05:06] WARNING  colossalai - colossalai.inference.utils - WARNING: 
                             /data/coding/ColossalAI/colossalai/inference/utils.
                             py:162 can_use_flash_attn2                         
                    WARNING  colossalai - colossalai.inference.utils - WARNING: 
                             flash_attn2 has not been installed yet, we will use
                             triton flash attn instead.                         
[11/04/24 11:05:06] INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/core/l
                             lm_engine.py:158 init_model                        
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO: the device is cuda:0                         
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/core/l
                             lm_engine.py:163 init_model                        
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO: Before the shard, Rank: [0], model size:     
                             12.551277160644531 GB, model's device is: cuda:0   
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0019431114196777344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009531974792480469 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007824897766113281 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007727146148681641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007011890411376953 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008337497711181641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0006923675537109375 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007014274597167969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007956027984619141 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0006723403930664062 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007219314575195312 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007529258728027344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.00080108642578125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010461807250976562 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007071495056152344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007612705230712891 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007638931274414062 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009360313415527344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010635852813720703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008685588836669922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010421276092529297 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008721351623535156 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009806156158447266 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008914470672607422 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010721683502197266 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008542537689208984 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008599758148193359 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008606910705566406 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008687973022460938 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009608268737792969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008566379547119141 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009701251983642578 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008494853973388672 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009267330169677734 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010409355163574219 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009996891021728516 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009884834289550781 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010025501251220703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.001371622085571289 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008530616760253906 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008502006530761719 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008380413055419922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010218620300292969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008378028869628906 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008902549743652344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008327960968017578 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008392333984375 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008347034454345703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008482933044433594 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008289813995361328 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008499622344970703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008337497711181641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008308887481689453 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008511543273925781 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008406639099121094 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008447170257568359 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008463859558105469 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007150173187255859 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007104873657226562 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007312297821044922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007383823394775391 seconds
[11/04/24 11:05:08] INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/core/l
                             lm_engine.py:193 init_model                        
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO: After the shard, Rank: [0], model size:      
                             12.551277160644531 GB, model's device is: cuda:0   
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/core/l
                             lm_engine.py:208 init_model                        
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO: Rank [0], Model Weight Max Occupy 2.33984375 
                             GB, Model size: 12.551277160644531 GB              
[11/04/24 11:05:08] INFO     colossalai -                                       
                             colossalai.inference.kv_cache.kvcache_manager -    
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/kv_cac
                             he/kvcache_manager.py:98 __init__                  
                    INFO     colossalai -                                       
                             colossalai.inference.kv_cache.kvcache_manager -    
                             INFO: Allocating K cache with shape: (384, 32, 16, 
                             16, 8), V cache with shape: (384, 32, 16, 128)     
                             consisting of 384 blocks.                          
                    INFO     colossalai -                                       
                             colossalai.inference.kv_cache.kvcache_manager -    
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/kv_cac
                             he/kvcache_manager.py:115 __init__                 
                    INFO     colossalai -                                       
                             colossalai.inference.kv_cache.kvcache_manager -    
                             INFO: Allocated 3.00 GB of KV cache on device      
                             cuda:0.                                            
[]

====== Training on All Nodes =====
127.0.0.1: success

====== Stopping All Nodes =====
127.0.0.1: finish

Environment

pytorch=2.3.1 python=3.10

nvidia-smi V100 32G, with CUDA=12.4

hpcaitech / ColossalAI

[BUG]: ColossalAI Inference example response empty result without error #6112

Is there an existing issue for this bug?

🐛 Describe the bug

code(Example code in colossalai inference readme):

run command:

Output:

Environment