Open EshamAaqib opened 1 month ago
I'm reading the config:
INFO 07-22 07:38:20 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config:
model='meta-llama/Meta-Llama-3-8B', speculative_config=None,
tokenizer='meta-llama/Meta-Llama-3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16,
max_seq_len=8192,
download_dir='/workspace/.cache/hub',
load_format=LoadFormat.AUTO,
tensor_parallel_size=4,
disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
seed=0, served_model_name=meta-llama/Meta-Llama-3-8B)
I think the error is mostly due to lack of supporting PagedAttention in vLLM for neuron backend at the moment, thus we require max-model-len
equal to block-size
.
max_model_len
I'm reading the config:
INFO 07-22 07:38:20 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='meta-llama/Meta-Llama-3-8B', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir='/workspace/.cache/hub', load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B)
I think the error is mostly due to lack of supporting PagedAttention in vLLM for neuron backend at the moment, thus we require
max-model-len
equal toblock-size
.
- the immediate short-term workaround is to stretch block-size to be equal to
max_model_len
- we are actively investigating the approach to implement PagedAttention, and bring it to vLLM for neuron backend support.
Thanks, I assume its this (https://docs.vllm.ai/en/latest/models/engine_args.html) :
--block-size
Possible choices: 8, 16, 32
Token block size for contiguous chunks of tokens.
Default: 16
If so, can we set it to the same value as max-model-len
or am I missing something ? when I tired it failed with the following
api_server.py: error: argument --block-size: invalid choice: 8192 (choose from 8, 16, 32)
- short-term solution is to extend the list of block_size options, so that we would be able to set it to desired size (e.g. 8196).
- mid-term solution is to develop paged attention support on neuron backend.
Thanks @liangfu, extending the block_size seems to work
4096
and 8192
fails with 4 GPUs allocated with memory allocation errors. 2048
or anything lower than that seems to work
2024-Jul-30 11:23:45.105173 1:1 ERROR TDRV:dmem_alloc_internal Failed to alloc DEVICE memory: 1073741824
2024-Jul-30 11:23:45.110278 1:1 ERROR TDRV:dml_dump Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_66a8cd41.csv
2024-Jul-30 11:23:45.114159 1:1 ERROR TDRV:log_dev_mem Failed to allocate 1.000GB (usage: tensors) on ND 0:NC 0, current utilization:
* total: 15.813GB
* tensors: 15.813GB
* runtime: 1.062KB
* dma rings: 32.000KB
2024-Jul-30 11:23:45.121919 1:1 ERROR TDRV:tensor_allocate Failed to allocate 1073741824 bytes on DEVICE for tensor UNKNOWN.
Your current environment
🐛 Describe the bug
vLLM 0.5.0 is failing on AWS
inf2
with the following, I have tried to run the following LLMs but fails with same errorargs used to launch vLLM -
Error -
In addition to this I tried running vLLM 0.5.2, but ran into the same issue mentioned here https://github.com/vllm-project/vllm/issues/6269#issuecomment-2221751738