vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.63k stars 3.19k forks source link

v0.3.0 openai.api_server fails for Mixtral-8x7B: FileNotFoundError #2780

Open olaf-beh opened 5 months ago

olaf-beh commented 5 months ago

v0.3.0 openai.api_server fails for Mixtral-8x7B: FileNotFoundError

Description

Command line to reproduce

CUDA_VISIBLE_DEVICES=0,1 /home/ob/venvs/vllm-venv-v0.3.0/bin/python -m vllm.entrypoints.openai.api_server --host localhost --port 8206 --model '/home/ob/models/huggingface/Mixtral-8x7B-Instruct-v0.1/2024-01-17' --served-model-name "Mixtral-8x7B-Instruct-v0.1" --tensor-parallel-size 2 --gpu-memory-utilization 0.9

Error Message

(vllm-venv-v0.3.0)ob@pascal:vllm-inst$ CUDA_VISIBLE_DEVICES=0,1 /home/ob/venvs/vllm-venv-v0.3.0/bin/python -m vllm.entrypoints.openai.api_server --host localhost --port 8206 --model '/home/ob/models/huggingface/Mixtral-8x7B-Instruct-v0.1/2024-01-17' --served-model-name "Mixtral-8x7B-Instruct-v0.1" --tensor-parallel-size 2 --gpu-memory-utilization 0.9
INFO 02-06 04:40:20 api_server.py:209] args: Namespace(host='localhost', port=8206, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Mixtral-8x7B-Instruct-v0.1', chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/home/ob/models/huggingface/Mixtral-8x7B-Instruct-v0.1/2024-01-17', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-02-06 04:40:23,753 INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-06 04:40:34 llm_engine.py:72] Initializing an LLM engine with config: model='/home/ob/models/huggingface/Mixtral-8x7B-Instruct-v0.1/2024-01-17', tokenizer='/home/ob/models/huggingface/Mixtral-8x7B-Instruct-v0.1/2024-01-17', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0)
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 114, in __init__
    self._init_cache()
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 308, in _init_cache
    num_blocks = self._run_workers(
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/worker/worker.py", line 116, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 599, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 534, in execute_model
    hidden_states = model_executable(
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 347, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 319, in forward
    hidden_states, residual = layer(positions, hidden_states,
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 283, in forward
    hidden_states = self.block_sparse_moe(hidden_states)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/model_executor/models/mixtral.py", line 137, in forward
    final_hidden_states = fused_moe(hidden_states,
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/model_executor/layers/fused_moe.py", line 270, in fused_moe
    invoke_fused_moe_kernel(hidden_states, w1, intermediate_cache1,
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/vllm/model_executor/layers/fused_moe.py", line 187, in invoke_fused_moe_kernel
    fused_moe_kernel[grid](
  File "<string>", line 63, in fused_moe_kernel
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/triton/compiler/compiler.py", line 425, in compile
    so_path = make_stub(name, signature, constants)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/triton/compiler/make_launcher.py", line 39, in make_stub
    so = _build(name, src_path, tmpdir)
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/triton/common/build.py", line 61, in _build
    cuda_lib_dirs = libcuda_dirs()
  File "/home/ob/venvs/vllm-venv-v0.3.0/lib/python3.9/site-packages/triton/common/build.py", line 21, in libcuda_dirs
    libs = subprocess.check_output(["ldconfig", "-p"]).decode()
  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.9/subprocess.py", line 1823, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
lawlietsoul commented 5 months ago

~same here on A100, would you mind helping us fixing this? That's will be very helpful, thank you!~

finally figure it out, like @iarbel84 pointed out, add ldconfig to the path then run vllm like usual (you may need to reinstall vllm etc.). suppose ldconfig is located at /sbin, then adding the path by: export PATH=$PATH:/sbin

iarbel84 commented 4 months ago

Same issue here on 8xL4 (GCP G2 instance). Solved it by adding ldconfig to path, but also forcing a reinstall of vllm, triton, ray

manishiitg commented 4 months ago

same issue