Issue Serving mistralai/Mixtral-8x7B-v0.1

JoshuaFurman commented 8 months ago

I've been successfully serving mistralai/Mixtral-8x7B-Instruct-v0.1 on 2xA100 80GBs with the following command just fine: python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2

However, i just tried switching to mistralai/Mixtral-8x7B-v0.1 and have been running into errors.

I did the following:

i just replaced the model name and was giving the following warning: WARNING 02-26 18:34:06 api_server.py:115] No chat template provided. Chat API will not work.
So i pulled this chat_template down (https://github.com/vllm-project/vllm/blob/main/examples/template_chatml.jinja) and added it to my command: python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-v0.1 --tensor-parallel-size 2 --chat-template ./template_chatml.jinja

After doing this, I'm now getting 404 errors from programs i was using to interface with Mixtral-Instruct. I successfully issued a curl request but it seemed to run indefinitely... I got one completion but it was a nonsense response...

Am i using a bad chat_template for Mixtral or is there something else I'm doing wrong?

Thank you!

simon-mo commented 8 months ago

Hmm it does come with a default chat template Hmm the model have a default config and I can't repro it on our official docker container

https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/blob/5c79a376139be989ef1838f360bf4f1f256d7aec/tokenizer_config.json#L42

root@7d6af8be4460:/workspace# curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
        -d '{
                "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
                "messages": [{"role": "user", "content": "Who are you and who created you?"}],
                "n": 3
        }'
{"id":"cmpl-46886726980042ceb4a3706268310c66","object":"chat.completion","created":131589,"model":"mistralai/Mixtral-8x7B-Instruct-v0.1","choices":[{"index":0,"message":{"role":"assistant","content":" I am a large language model trained by Mistral AI, a leading AI company based in Paris. I am designed to generate human-like text based on the prompts I receive. I don't have personal experiences or emotions, and I don't have the ability to create or have consciousness. I'm just a piece of software running on a computer."},"finish_reason":"stop"},{"index":2,"message":{"role":"assistant","content":" I am a large language model trained by Mistral AI, a leading AI company based in Paris. I am designed to generate human-like text based on the prompts I am given. I do not have the ability to access personal data about individuals, perform calculations, or use the internet."},"finish_reason":"stop"},{"index":1,"message":{"role":"assistant","content":" I am a large language model trained by Mistral AI, a leading AI company based in Paris. I am designed to generate human-like text based on the input I receive. I do not have personal experiences or emotions, but I can simulate a conversational experience with you. I was created by a team of talented engineers and researchers who worked on my training and development."},"finish_reason":"stop"}],"usage":{"prompt_tokens":17,"total_tokens":229,"completion_tokens":212}}

root@7d6af8be4460:/workspace# python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2
INFO 02-26 21:12:17 api_server.py:229] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 720/720 [00:00<00:00, 5.02MB/s]
INFO 02-26 21:12:17 config.py:413] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-02-26 21:12:19,252 WARNING services.py:1996 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67104768 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-02-26 21:12:20,401 INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-26 21:12:21 llm_engine.py:79] Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.46k/1.46k [00:00<00:00, 12.6MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 493k/493k [00:00<00:00, 19.6MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.80M/1.80M [00:00<00:00, 17.1MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 72.0/72.0 [00:00<00:00, 731kB/s]
INFO 02-26 21:12:40 weight_utils.py:163] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=7023) INFO 02-26 21:12:40 weight_utils.py:163] Using model weights format ['*.safetensors']
model-00001-of-00019.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.89G/4.89G [01:36<00:00, 50.8MB/s]
INFO 02-26 21:18:24 llm_engine.py:337] # GPU blocks: 21933, # CPU blocks: 4096█████████████████████████████████████████████████████████████████████▋     | 4.75G/4.98G [01:20<00:02, 85.4MB/s]
INFO 02-26 21:18:26 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.█████████████████████████████████████████████████████████████████████▋                                 | 2.99G/4.22G [01:01<00:24, 49.9MB/s]
INFO 02-26 21:18:26 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.██████████████████████████████████████████████████▊                 | 3.59G/4.22G [01:13<00:12, 51.0MB/s]
(RayWorkerVllm pid=7023) INFO 02-26 21:18:26 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=7023) INFO 02-26 21:18:26 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-26 21:18:31 model_runner.py:748] Graph capturing finished in 4 secs.
(RayWorkerVllm pid=7023) INFO 02-26 21:18:31 model_runner.py:748] Graph capturing finished in 4 secs.
INFO 02-26 21:18:31 serving_chat.py:265] Using default chat template:
INFO 02-26 21:18:31 serving_chat.py:265] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO:     Started server process [16]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 02-26 21:18:41 metrics.py:161] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-26 21:23:50 async_llm_engine.py:433] Received request cmpl-46886726980042ceb4a3706268310c66: prompt: '<s>[INST] Who are you and who created you? [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=3, best_of=3, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32751, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 733, 16289, 28793, 6526, 460, 368, 304, 693, 3859, 368, 28804, 733, 28748, 16289, 28793], lora_request: None.
INFO 02-26 21:23:51 metrics.py:161] Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-26 21:23:52 async_llm_engine.py:110] Finished request cmpl-46886726980042ceb4a3706268310c66.
INFO:     127.0.0.1:39430 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Can you double check that

The model isn't local or behind a proxy
Which version of vLLM are you using?
Inspect the tokenizer config json in the downloaded folder to see whether chat template exist.
Is the model cached locally up to date?

JoshuaFurman commented 8 months ago

Hey @simon-mo thanks for responding. I should have been a little more explicit, I have NO issues with mistralai/Mixtral-8x7B-Instruct-v0.1. My troubles are coming from mistralai/Mixtral-8x7B-v0.1, which does not look like it has a chat_template in its config.json (https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).

Specifically trying to get mistralai/Mixtral-8x7B-v0.1 up and running.

simon-mo commented 8 months ago

Oh sorry I misread. The base model isn't tuned with instruction so it really can't do chat regardless which template you use. This is why the model is running indefinitely because it doesn't know to output the end of sequence token.

JoshuaFurman commented 8 months ago

ahh I see. I guess the issue is coming from my understanding not the model lol. I appreciate your help and patience! I'll continue with the Instruct model!

vllm-project / vllm

Issue Serving mistralai/Mixtral-8x7B-v0.1 #3041