vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.27k stars 4.58k forks source link

[Bug]: Phi-3-small-128k-instruct on 4 T4 GPUs - Memory error: Tried to allocate 1024.00 GiB #7590

Open jgen1 opened 3 months ago

jgen1 commented 3 months ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Wolfi (x86_64) GCC version: Could not collect Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.39 Python version: 3.11.9 (tags/v3.11.9-0-gde54cf5-dirty:de54cf5, Aug 8 2024, 11:36:54) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.10.199-190.747.amzn2.x86_64-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla T4 GPU 1: Tesla T4 GPU 2: Tesla T4 GPU 3: Tesla T4 Nvidia driver version: 535.104.12 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: /bin/sh: lscpu: not found Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pyzmq==26.1.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.0 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4@4db5176d9758b720b05460c50ace3c01026eb158 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB 0-47 N/A N/A GPU1 PHB X PHB PHB 0-47 N/A N/A GPU2 PHB PHB X PHB 0-47 N/A N/A GPU3 PHB PHB PHB X 0-47 N/A N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

I am trying to deploy a Phi-3 model on vLLM. Phi-3-small-128k-instruct is ~15GB and 7B parameters, so it should easily fit on my 4 Tesla T4 GPUs, which have 16GB RAM each. However, I am getting a ran out of memory error as it tries to load the model that makes no sense. It says tried to allocate 1024 GB, which does not seem possible since the model itself is not nearly that big. My parameters for this deployment are --dtype float16, --tensor-parallel-size 4. And I had to add --trust-remote-code (even though other issues say this is no longer needed for Phi-3 models, my transformers package is up to date but was requiring this parameter).

I know this looks like a standard out of memory error, but this model is ~15 GB, so why would it be trying to allocate 1024GB? Any guidance would be appreciated, thanks.

ERROR 08-16 15:18:01 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1740 died, exit code: -15
INFO 08-16 15:18:01 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
             ^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
    super().__init__(*args, **kwargs)
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor
    self._run_workers("load_model",
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model
    model = _initialize_model(model_config, self.load_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 154, in _initialize_model
    return model_class(config=model_config.hf_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 361, in __init__
    self.model = Phi3SmallModel(config, cache_config, quant_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 310, in __init__
    self.layers = nn.ModuleList([
                                ^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 311, in <listcomp>
    Phi3SmallDecoderLayer(config, layer_idx, cache_config,
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 261, in __init__
    self.self_attn = Phi3SmallSelfAttention(config,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 213, in __init__
    self.attn = Attention(
                ^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 84, in __init__
    self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/backends/blocksparse_attn.py", line 327, in __init__
    self.bs_attn = LocalStridedBlockSparseAttn(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/ops/blocksparse_attention/interface.py", line 60, in __init__
    self.get_attn_pattern(dtype, device))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/ops/blocksparse_attention/interface.py", line 80, in get_attn_pattern
    sparse_layout, sparse_pattern, dense_attn_mask = get_sparse_attn_mask(
                                                     ^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/ops/blocksparse_attention/utils.py", line 222, in get_sparse_attn_mask
    mask_dense = torch.kron(
                 ^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/utils/_device.py", line 79, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 GiB. GPU 0 has a total capacity of 14.58 GiB of which 13.60 GiB is free. Process 37249 has 996.00 MiB memory in use. Of the allocated memory 756.12 MiB is allocated by PyTorch, and 49.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
mgoin commented 3 months ago

This may be a bug with the blocksparse_attention that is specific to Phi3-small. Could you try the larger Phi-3-medium-128k-instruct to test that it is just an issue with that?

jgen1 commented 3 months ago

@mgoin I couldn't get Phi-3-medium-128k-instruct working on my hardware either, but I think that is because the model just barely does not fit on my 4 T4's (reduced down to 2):

"Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 14.58 GiB of which 1.18 GiB is free."

I will double check this but I believe for the Phi-3-medium-128k-instruct model I had to reduce tensor-parallel-size down to 2 because I was getting an assertion error on assert self.total_num_kv_heads % tp_size == 0. Looks like it is because the number of attention heads is not divisible by 4, according to #1581

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 14.58 GiB of which 1.18 GiB is free. Process 26078 has 13.40 GiB memory in use. Of the allocated memory 13.18 GiB is allocated by PyTorch, and 31.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Regardless, Phi-3-medium-128k-instruct is not having this same strange error about trying to allocate 1024GB

linxihui commented 3 months ago

@jgen1 Could you this line to

https://github.com/vllm-project/vllm/blob/832163b8754efe2c6d74fbecb6a87a4119410db4/vllm/attention/ops/blocksparse_attention/interface.py#L11-L12

to

IS_COMPUTE_8_OR_ABOVE = (torch.cuda.is_available()
                         and current_platform.get_device_capability() >= (7, 5))

and give a try? I do not have a T4 GPU to test.

The reason you have this issue, is because the phi-3-small uses a blocksparse attention instead normal attention. For T4 or later (compute capability of 7.5 or above), triton kernel works. While if is below 7.5, it falls back to use SPDA plus an attention mask, is huge.

The code has a bug: its check if compute capability is 8 or above instead of 7.5 (T4) or above.

Le me know if the problem get fixed before I make a PR.

jgen1 commented 2 months ago

@linxihui I made this change and I am now getting a new error, also related to blocksparse attention:

ERROR 08-19 13:54:24 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 456 died, exit code: -15
INFO 08-19 13:54:24 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
             ^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263, in __init__
    self._initialize_kv_caches()
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 940, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 418, in forward
    output_hidden_states = self.model(
                           ^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 338, in forward
    hidden_states = layer(
                    ^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 282, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/model_executor/models/phi3_small.py", line 244, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata=attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/backends/blocksparse_attn.py", line 402, in forward
    output = self.bs_attn(
             ^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/ops/blocksparse_attention/interface.py", line 234, in forward
    return self.varlen_attn(q,
           ^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/ops/blocksparse_attention/interface.py", line 132, in varlen_attn
    return blocksparse_flash_attn_varlen_fwd(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nonroot/.local/lib/python3.11/site-packages/vllm/attention/ops/blocksparse_attention/blocksparse_attention_kernel.py", line 23, in blocksparse_flash_attn_varlen_fwd
    batch_size = cu_seqlens_k.size(0) - 1
                 ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'size'
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
linxihui commented 2 months ago

@jgen1 thanks for giving it shots. It is weird that the seq_start_loc is None. Do you enable prefix caching? The blocksparse attention in vllm currently does not support that. We are working on it. So you may need to disable prefix caching for now.

jgen1 commented 2 months ago

@linxihui I don't believe so... this is my command to start the server:

python3 -m vllm.entrypoints.openai.api_server --model Phi-3-small-128k-instruct --chat-template "${CHAT_TEMPLATE}"  --quantization None --dtype float16 --enforce-eager --tensor-parallel-size 4 --max-model-len ${MAX_MODEL_LEN}

When MAX_MODEL_LEN=$(jq -r ".max_position_embeddings" $MODEL_NAME/config.json) and CHAT_TEMPLATE=$(jq -r ".chat_template" $MODEL_NAME/tokenizer_config.json)