intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.65k stars 1.26k forks source link

vllm在A770运行过程中,卡住了 #11694

Open biyuehuang opened 3 months ago

biyuehuang commented 3 months ago

A770,Ubuntu系统

for n in $(seq 16 16); do

    echo "Model= $MODEL RATE= 0.7 N= $n..."

     python3 benchmark_vllm_throughput.py \

                --backend vllm \

                --model $MODEL \

                --num-prompts 100 \

                --trust-remote-code \

                --enforce-eager \

                --dtype float16 \

                --device xpu \

                --load-in-low-bit sym_int4 \

                --gpu-memory-utilization 0.70 \

                --input-len 1024 \

                --output-len 512 \

                --max-num-seqs $n \

                --max-model-len 2048 \

                --max-num-batched-tokens 4096 \

                --tensor-parallel-size 2

image

以下是打印的信息,在最后一句的地方卡住了。

Model= /opt/Qwen1.5-14B-Chat RATE= 0.7 N= 16...

Namespace(backend='vllm', dataset=None, input_len=1024, output_len=512, model='/opt/Qwen1.5-14B-Chat', tokenizer='/opt/Qwen1.5-14B-Chat', quantization=None, tensor_parallel_size=2, n=1, use_beam_search=False, num_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=True, max_model_len=2048, dtype='float16', gpu_memory_utilization=0.7, enforce_eager=True, kv_cache_dtype='auto', device='xpu', enable_prefix_caching=False, load_in_low_bit='sym_int4', max_num_batched_tokens=4096, max_num_seqs=16)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

/root/miniforge3/envs/vtune-vllm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?

  warn(

2024-07-31 21:20:24,618 - INFO - intel_extension_for_pytorch auto imported

WARNING 07-31 21:20:31 config.py:710] Casting torch.bfloat16 to torch.float16.

INFO 07-31 21:20:31 config.py:523] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.

2024-07-31 21:20:42,830 INFO worker.py:1788 -- Started a local Ray instance.

INFO 07-31 21:20:50 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/opt/Qwen1.5-14B-Chat', tokenizer='/opt/Qwen1.5-14B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=4096, max_num_seqs=16, max_model_len=2048)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

(RayWorkerVllm pid=80957) /root/miniforge3/envs/vtune-vllm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?

(RayWorkerVllm pid=80957)   warn(
gc-fu commented 3 months ago

Hi, this is an known issue when using tensor-parallel feature on host/bare metal.

You can try follow our docker guide to solve this problem.