Open servient-ashwin opened 3 months ago
CPU usage peaks at 31% before freezing and is also consistent every time the loading of a model freezes. Unable to debug due to this as well.
This most certainly has started to feel like a memory issue. People are hosting smaller models on 48x large instances of inferentia and that could be the issue here. Lowering the precision doesn't help either.
So changing the instance type to 48x
large which is a significant change does help in loading the model (this almost takes out the advantage of low cost inferencing, unless you think really long term). However, here's another issue related to memory
INFO 07-17 19:35:07 api_server.py:178] args: Namespace(host='langmodel1', port=8006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mistral-7B-Instruct-v0.3', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir='/tmp/', load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-17 19:35:11 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/tmp/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3)
tokenizer_config.json: 100%|████████████████████████████████████| 138k/138k [00:00<00:00, 2.90MB/s]
WARNING 07-17 19:35:12 utils.py:456] Pin memory is not supported on Neuron.
Downloading shards: 100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 37.79it/s]
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
(aws_neuron_venv_pytorch) root@testlangmodel:~#
(aws_neuron_venv_pytorch) root@testlangmodel:~# python -m vllm.entrypoints.openai.api_server --modeLoading checkpoint shards: 100%|█████████████████████████████████████| 3/3 [01:49<00:00, 36.65s/it]
Traceback (most recent call last):
File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/torch/serialization.py", line 619, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/torch/serialization.py", line 853, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:588] . PytorchStreamWriter failed writing file data/0: file write failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 196, in <module>
engine = AsyncLLMEngine.from_engine_args(
File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 395, in from_engine_args
engine = cls(
File "/root/aws_neuron_venv_pytorch/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "
The only logical next step is to abandon vLLM and use hugging face optimim-neuron library [here](https://huggingface.co/docs/optimum-neuron/guides/export_model] and here and follow the guide in the second link to run inference, however those are steep methods to get a model working with a possibility that things could change given the AWS Neuron documentation periodically mentions it being in beta and such.
However, I feel like this is too early to switch to an entirely different library just to fit something. If anyone has any inputs or past experiences getting vLLM to work with Neuron they are much appreciated.
Each NeuronCore contains 16GiB of HBM memory, setting --tensor-parallel-size 1
limits the number of NeuronCores that are used for LLM inference.
It will be relative easy to start from small models, and gradually scale up to larger models and longer sequence lengths. Are you able to start from tinyllama 1.1B on TP=2 (with 32GB HBM) ?
Fair point. I'll try that @liangfu . Thanks for the suggestion. Appreciate it.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
Found some help from https://github.com/vllm-project/vllm/issues/6269#issuecomment-2229220837 for installation.
🐛 Describe the bug
I am trying to setup AWS Inferentia instance following the guide from vLLM here and the patchwork form AWS Neuron docs
and using the command
however the model loading seems to get stuck at precisely 33% everytime I try to load the model
and stays there. I have seen that the python process that spins this up stalls after
23.5GB
of usage and the machine itself freezes after this. I have not tried any other models as of yet.