[Bug]: Running vllm docker image with neuron fails

yaronr commented 6 months ago

Your current environment

root@9c92d584ab5f:/app# python3 ./collect_env.py Collecting environment information... WARNING 05-15 15:13:52 ray_utils.py:46] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For multi-node inference, please install Ray with pip install ray. PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: glibc-2.31

Python version: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-4.14.343-260.564.amzn2.x86_64-x86_64-with-glibc2.31 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7R13 Processor Stepping: 1 CPU MHz: 3553.882 BogoMIPS: 5299.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 64 KiB L1i cache: 64 KiB L2 cache: 1 MiB L3 cache: 8 MiB NUMA node0 CPU(s): 0-3 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected, RAS-Poisoning: Vulnerable Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save vaes vpclmulqdq rdpid

Versions of relevant libraries: [pip3] numpy==1.25.2 [pip3] nvidia-nccl-cu12==2.18.1 [pip3] sagemaker_pytorch_inference==2.0.21 [pip3] torch==2.1.2 [pip3] torch-model-archiver==0.9.0 [pip3] torch-neuronx==2.1.1.2.0.1b0 [pip3] torch-xla==2.1.1 [pip3] torchserve==0.9.0 [pip3] torchvision==0.16.2 [pip3] triton==2.1.0 [conda] mkl 2024.0.0 ha957f24_49657 conda-forge [conda] mkl-include 2024.0.0 ha957f24_49657 conda-forge [conda] numpy 1.25.2 pypi_0 pypi [conda] nvidia-nccl-cu12 2.18.1 pypi_0 pypi [conda] sagemaker-pytorch-inference 2.0.21 pypi_0 pypi [conda] torch 2.1.2 pypi_0 pypi [conda] torch-model-archiver 0.9.0 pypi_0 pypi [conda] torch-neuronx 2.1.1.2.0.1b0 pypi_0 pypi [conda] torch-xla 2.1.1 pypi_0 pypi [conda] torchserve 0.9.0 pypi_0 pypi [conda] torchvision 0.16.2 pypi_0 pypi [conda] triton 2.1.0 pypi_0 pypiROCM Version: Could not collect Neuron SDK Version: (0, 'instance-type: inf2.xlarge\ninstance-id: i-072ac184a3a22e2b5\n+--------+--------+--------+---------+\n| NEURON | NEURON | NEURON | PCI |\n| DEVICE | CORES | MEMORY | BDF |\n+--------+--------+--------+---------+\n| 0 | 2 | 32 GB | 00:1f.0 |\n+--------+--------+--------+---------+', '') vLLM Version: 0.4.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect

🐛 Describe the bug

I built the docker image as follows:

git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f ./Dockerfile.neuron . -t vllm:0.4.2-neuron

Then ran it with

docker run -ti --device=/dev/neuron0 -e...
python3 -m vllm.entrypoints.openai.api_server
                --model=meta-llama/Meta-Llama-3-8B-Instruct 
                --device=neuron 
                --tensor-parallel-size=2 
                --gpu-memory-utilization=0.9 
                --enforce-eager
Logs:

WARNING 05-15 15:19:38 ray_utils.py:46] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For multi-node inference, please install Ray with pip install ray. WARNING 05-15 15:19:39 config.py:404] Possibly too large swap space. 8.00 GiB out of the 15.31 GiB total CPU memory is allocated for the swap space. INFO 05-15 15:19:39 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 05-15 15:19:42 utils.py:443] Pin memory is not supported on Neuron. Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last): File "/usr/local/bin/dockerd-entrypoint.py", line 28, in subprocess.check_call(shlex.split(" ".join(sys.argv[1:]))) File "/opt/conda/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['python3', '-m', 'vllm.entrypoints.openai.api_server', '--model=meta-llama/Meta-Llama-3-8B-Instruct', '--device=neuron', '--tensor-parallel-size=2', '--gpu-memory-utilization=0.9', '--enforce-eager']' died with <Signals.SIGKILL: 9>.

yaronr commented 5 months ago

Update: vllm 5.0 post, added ray to requirements-neuron.txt, still not working:

WARNING 06-14 08:15:40 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 06-14 08:15:45 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 06-14 08:15:45 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=True, max_log_len=None)
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
INFO 06-14 08:15:47 config.py:623] Defaulting to use ray for distributed inference
WARNING 06-14 08:15:47 config.py:437] Possibly too large swap space. 8.00 GiB out of the 15.27 GiB total CPU memory is allocated for the swap space.
INFO 06-14 08:15:47 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-14 08:15:48 utils.py:465] Pin memory is not supported on Neuron.
Loading checkpoint shards:  25%|██▌       | 1/4 [00:37<01:51, 37.24s/it]Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 28, in <module>
    subprocess.check_call(shlex.split(" ".join(sys.argv[1:])))
  File "/opt/conda/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['python3', '-m', 'vllm.entrypoints.openai.api_server', '--port=8000', '--model=meta-llama/Meta-Llama-3-8B-Instruct', '--tensor-parallel-size=2', '--disable-log-requests', '--enable-prefix-caching', '--gpu-memory-utilization=0.9']' died with <Signals.SIGKILL: 9>.

ashrafMahgoub commented 3 months ago

Ray is not required for Neuron device. I see you are attaching only one core to the container when calling docker run, for tp=2, at least 2 Neuron cores should be attached. Can you please modify the docker run command to include these two devices?

-device=/dev/neuron0 -device=/dev/neuron1

karllessard commented 3 months ago

@ashrafMahgoub Reading the AWS doc, it seems that each neuron device should have two neuron cores. In this case, requesting for a single device should be enough? With EKS, I tried requesting for 2 devices on a inf2 instance that has a single Inferentia2 chip and it failed: Could not open the nd2

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

vllm-project / vllm

[Bug]: Running vllm docker image with neuron fails #4836

Your current environment

🐛 Describe the bug