vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.77k stars 3.92k forks source link

[Usage]: not support for mistralai/Mistral-7B-Instruct-v0.3 #5061

Closed yananchen1989 closed 3 months ago

yananchen1989 commented 3 months ago

Your current environment

vllm version: 0.4.2

CUDA_VISIBLE_DEVICES=6  python  -m vllm.entrypoints.openai.api_server \
>     --model mistralai/Mistral-7B-Instruct-v0.3  \
>     --dtype auto --api-key "yyy" --port 1703

error message:

INFO 05-26 20:11:31 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3) You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers INFO 05-26 20:11:32 utils.py:660] Found nccl from library /home/chenyanan/.config/vllm/nccl/cu12/libnccl.so.2.18.1 INFO 05-26 20:11:35 selector.py:27] Using FlashAttention-2 backend. rank0:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) INFO 05-26 20:11:38 weight_utils.py:199] Using model weights format '*.safetensors': Traceback (most recent call last): rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/runpy.py", line 197, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/runpy.py", line 87, in _run_code rank0: exec(code, run_globals) rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 168, in rank0: engine = AsyncLLMEngine.from_engine_args( rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args rank0: engine = cls( rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 324, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 160, in init rank0: self.model_executor = executor_class( rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 41, in init

rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 23, in _init_executor

rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker

rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/worker/worker.py", line 118, in load_model

rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 164, in load_model rank0: self.model = get_model( rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model rank0: return loader.load_model(model_config=model_config, rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 224, in load_model

rank0: File "/home/chenyanan/anaconda3/envs/tp/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 415, in load_weights rank0: param = params_dict[name]

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

jasonacox commented 3 months ago

Did you pull latest or is that the 0.4.2 tag checkpoint? The vllm 0.4.2 build doesn't work with Mistral-7B-Instruct-v0.3. See PR https://github.com/vllm-project/vllm/pull/5005 (hash 91977095)

I pulled the latest (hash 8e192ff9) and built vLLM. It works fine with Mistral-7B-Instruct-v0.3. Here are the steps I used to create and run a docker image:

# Download source and pin to hash
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 8e192ff9

# Build Docker container
DOCKER_BUILDKIT=1 docker build . -f Dockerfile --target vllm-openai --tag vllm-src

# Run vLLM
docker run -d --gpus all \
    -v $PWD/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=xyz" \
    -p 8008:8000 \
    --restart unless-stopped \
    --name vllm \
    vllm-src \
    --host 0.0.0.0 \
    --model=mistralai/Mistral-7B-Instruct-v0.3 \
    --gpu-memory-utilization 0.95
robertgshaw2-neuralmagic commented 3 months ago

Yes this is fixed on current main and will be part of upcoming release this week

yananchen1989 commented 3 months ago

tested, success. thanks.

UmutAlihan commented 3 months ago

in the era where the speed is the ultimate currency I do love how fast free open source community is is passing the multi trillion dollar organizations by running so fast, cheers to all contributors