vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.79k stars 4.1k forks source link

[New Model]: mistralai/Codestral-22B-v0.1 #5318

Closed eduardozamudio closed 3 months ago

eduardozamudio commented 3 months ago

The model to consider.

Hi. Could you add support to mistralai/Codestral-22B-v0.1?

Thanks!

The closest model vllm already supports.

https://huggingface.co/meta-llama/CodeLlama-7b-hf https://huggingface.co/mistralai/Mistral-7B-v0.3

What's your difficulty of supporting the model you want?

Can't load Codestral-22B-v0.1 using OpenAI server API

ORG="mistralai"
MODEL="Codestral-22B-v0.1"
API_KEY=XXXXXXXXXXXXXXXXXXXXXX

python -m vllm.entrypoints.openai.api_server \
       --tokenizer $ORG/$MODEL \
       --model $ORG/$MODEL \
       --served-model-name $MODEL \
       --tensor-parallel-size 4 \
       --gpu-memory-utilization 0.9 \
       --max-model-len 4096 \
       --enforce-eager \
       --api-key $API_KEY 
mgoin commented 3 months ago

@eduardozamudio can you please share the version of vLLM used and the error of why the model won't load? It is a MistralForCausalLM, so I would expect it to run as Mistral models do.

martinezmatias commented 3 months ago

Hi @eduardozamudio (Hola Eduardo!)

mistralai/Codestral-22B-v0.1 worked for me using vllm 0.4.3.

Regards Matias

getorca commented 3 months ago

Hi @eduardozamudio (Hola Eduardo!)

mistralai/Codestral-22B-v0.1 worked for me using vllm 0.4.3.

Regards Matias

Does it work for "fill in the middle" https://huggingface.co/mistralai/Codestral-22B-v0.1#fill-in-the-middle-fim? I image there would be some work required to support the prefix and suffix params both in the rest API and core APIs...

I haven't dug into the https://github.com/mistralai/mistral-inference code yet, but I think it's just uses special tokens to mark prefix, suffix and middle, so it can probably also be implemented outside of vllm and just passed as the normal input...

eduardozamudio commented 3 months ago

@eduardozamudio can you please share the version of vLLM used and the error of why the model won't load? It is a MistralForCausalLM, so I would expect it to run as Mistral models do.

I've updated to v0.4.3 and still getting the error.

eduardozamudio commented 3 months ago

The model to consider.

Hi. Could you add support to mistralai/Codestral-22B-v0.1?

Thanks!

The closest model vllm already supports.

https://huggingface.co/meta-llama/CodeLlama-7b-hf https://huggingface.co/mistralai/Mistral-7B-v0.3

What's your difficulty of supporting the model you want?

Can't load Codestral-22B-v0.1 using OpenAI server API

ORG="mistralai"
MODEL="Codestral-22B-v0.1"
API_KEY=XXXXXXXXXXXXXXXXXXXXXX

python -m vllm.entrypoints.openai.api_server \
       --tokenizer $ORG/$MODEL \
       --model $ORG/$MODEL \
       --served-model-name $MODEL \
       --tensor-parallel-size 4 \
       --gpu-memory-utilization 0.9 \
       --max-model-len 4096 \
       --enforce-eager \
       --api-key $API_KEY 

Here is the output. Could it be a dependency problem?

[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/entrypoints/openai/api_server.py", line 196, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/engine/async_llm_engine.py", line 395, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/engine/async_llm_engine.py", line 349, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/engine/async_llm_engine.py", line 470, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/engine/llm_engine.py", line 235, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/engine/llm_engine.py", line 312, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]:     driver_worker_output = self.driver_worker.execute_method(
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]:     raise e
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/worker/worker.py", line 154, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/worker/model_runner.py", line 833, in profile_run
[rank0]:     self.execute_model(seqs, kv_caches)
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/worker/model_runner.py", line 738, in execute_model
[rank0]:     hidden_states = model_executable(
[rank0]:                     ^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/model_executor/models/llama.py", line 371, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/model_executor/models/llama.py", line 288, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/model_executor/models/llama.py", line 223, in forward
[rank0]:     hidden_states = self.input_layernorm(hidden_states)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/model_executor/custom_op.py", line 13, in forward
[rank0]:     return self._forward_method(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/model_executor/layers/layernorm.py", line 62, in forward_cuda
[rank0]:     ops.rms_norm(
[rank0]:   File "/home/jovyan/ezamudio/vllm/vllm/_custom_ops.py", line 132, in rms_norm
[rank0]:     torch.ops._C.rms_norm(out, input, weight, epsilon)
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/_ops.py", line 921, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
mgoin commented 3 months ago

@eduardozamudio it seems like you have built vllm from source, so it is possible your environment has not built the library correctly. Could you try installing the pre-built package from pypi to confirm you don't see this issue on the actual release?

eduardozamudio commented 3 months ago

@eduardozamudio it seems like you have built vllm from source, so it is possible your environment has not built the library correctly. Could you try installing the pre-built package from pypi to confirm you don't see this issue on the actual release?

Excellent!

I confirm that the issue is not present anymore using the pre-built package.

Thanks @mgoin!