vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.52k stars 4.43k forks source link

[Bug]: vLLM is unable to load Mistral on Inferentia and AWS neuron, likely memory issue. #6452

Open servient-ashwin opened 3 months ago

servient-ashwin commented 3 months ago

Your current environment

The output of `python collect_env.py`

Collecting environment information...
WARNING 07-15 19:13:04 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1020-aws-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R13 Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           5299.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           1 MiB (2 instances)
L3 cache:                           8 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] torch-neuronx==2.1.2.2.2.0
[pip3] torch-xla==2.1.3
[pip3] torchvision==0.16.2
[pip3] transformers==4.42.4
[pip3] transformers-neuronx==0.11.351
[pip3] triton==2.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: (0, 'instance-type: inf2.xlarge\ninstance-id: i-01bbe72eda7217750\n+--------+--------+--------+---------+\n| NEURON | NEURON | NEURON |   PCI   |\n| DEVICE | CORES  | MEMORY |   BDF   |\n+--------+--------+--------+---------+\n| 0      | 2      | 32 GB  | 00:1f.0 |\n+--------+--------+--------+---------+', '')
vLLM Version: 0.5.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

Found some help from https://github.com/vllm-project/vllm/issues/6269#issuecomment-2229220837 for installation.

🐛 Describe the bug

I am trying to setup AWS Inferentia instance following the guide from vLLM here and the patchwork form AWS Neuron docs

and using the command

python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3  --download-dir /tmp/ --port 8006 --tensor-parallel-size 1 --gpu-memory-utilization 1

however the model loading seems to get stuck at precisely 33% everytime I try to load the model

(aws_neuron_venv_pytorch) root@testlangmodel:~/vllm# WARNING 07-15 20:44:24 _custom_ops.py:11] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 07-15 20:44:30 api_server.py:177] vLLM API server version 0.5.0
INFO 07-15 20:44:30 api_server.py:178] args: Namespace(host='testlangmodel', port=8006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mistral-7B-Instruct-v0.3', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir='/tmp/', load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=1.0, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-15 20:44:33 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/tmp/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3)
WARNING 07-15 20:44:34 utils.py:456] Pin memory is not supported on Neuron.
Loading checkpoint shards:  33%|████████████▎                        | 1/3 [00:37<01:14, 37.27s/it]

and stays there. I have seen that the python process that spins this up stalls after 23.5GB of usage and the machine itself freezes after this. I have not tried any other models as of yet.

servient-ashwin commented 3 months ago

CPU usage peaks at 31% before freezing and is also consistent every time the loading of a model freezes. Unable to debug due to this as well.

servient-ashwin commented 3 months ago

This most certainly has started to feel like a memory issue. People are hosting smaller models on 48x large instances of inferentia and that could be the issue here. Lowering the precision doesn't help either.

servient-ashwin commented 3 months ago

So changing the instance type to 48x large which is a significant change does help in loading the model (this almost takes out the advantage of low cost inferencing, unless you think really long term). However, here's another issue related to memory

INFO 07-17 19:35:07 api_server.py:178] args: Namespace(host='langmodel1', port=8006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mistral-7B-Instruct-v0.3', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir='/tmp/', load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-17 19:35:11 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/tmp/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3)
tokenizer_config.json: 100%|████████████████████████████████████| 138k/138k [00:00<00:00, 2.90MB/s]
WARNING 07-17 19:35:12 utils.py:456] Pin memory is not supported on Neuron.
Downloading shards: 100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 37.79it/s]
Loading checkpoint shards:   0%|                                             | 0/3 [00:00<?, ?it/s]
(aws_neuron_venv_pytorch) root@testlangmodel:~#
(aws_neuron_venv_pytorch) root@testlangmodel:~# python -m vllm.entrypoints.openai.api_server --modeLoading checkpoint shards: 100%|█████████████████████████████████████| 3/3 [01:49<00:00, 36.65s/it]
Traceback (most recent call last):
  File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/torch/serialization.py", line 619, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
  File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/torch/serialization.py", line 853, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:588] . PytorchStreamWriter failed writing file data/0: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 196, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 395, in from_engine_args
    engine = cls(
  File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "

The only logical next step is to abandon vLLM and use hugging face optimim-neuron library [here](https://huggingface.co/docs/optimum-neuron/guides/export_model] and here and follow the guide in the second link to run inference, however those are steep methods to get a model working with a possibility that things could change given the AWS Neuron documentation periodically mentions it being in beta and such.

However, I feel like this is too early to switch to an entirely different library just to fit something. If anyone has any inputs or past experiences getting vLLM to work with Neuron they are much appreciated.

liangfu commented 3 months ago

Each NeuronCore contains 16GiB of HBM memory, setting --tensor-parallel-size 1 limits the number of NeuronCores that are used for LLM inference.

It will be relative easy to start from small models, and gradually scale up to larger models and longer sequence lengths. Are you able to start from tinyllama 1.1B on TP=2 (with 32GB HBM) ?

servient-ashwin commented 3 months ago

Fair point. I'll try that @liangfu . Thanks for the suggestion. Appreciate it.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!