vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.2k stars 4.57k forks source link

Error while loading mixtral instruct #2174

Closed rakesgi2022 closed 7 months ago

rakesgi2022 commented 11 months ago

Hello,

I get this error when I try to load the mistralai/Mixtral-8X7B-Instruct-v0.1 model into the latest container with 2 A100s... Is it related to the hugginface key? I've run out of ideas!

/(RayWorkerVllm pid=1334) tensors: 98%|█████████▊| 4.89G/4.98G [04:23<00:09, 9.86MB/s] (RayWorkerVllm pid=1334) (RayWorkerVllm pid=1334) tensors: 98%|█████████▊| 4.90G/4.98G [04:24<00:07, 12.2MB/s] (RayWorkerVllm pid=1334) tensors: 98%|█████████▊| 4.91G/4.98G [04:24<00:04, 15.1MB/s] (RayWorkerVllm pid=1334) tensors: 99%|█████████▊| 4.92G/4.98G [04:24<00:03, 18.7MB/s] model-00009-of-00019.safetensors: 100%|██████████| 4.98G/4.98G [04:28<00:00, 18.6MB/s] Traceback (most recent call last): 99%|█████████▉| 4.94G/4.98G [04:25<00:01, 26.4MB/s] File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main0:01, 30.0MB/s] return _run_code(code, main_globals, None,██▉| 4.97G/4.98G [04:27<00:00, 37.0MB/s] File "/usr/lib/python3.10/runpy.py", line 86, in _run_code8G [04:27<00:00, 12.9MB/s] exec(code, run_globals) File "/workspace/vllm/entrypoints/openai/api_server.py", line 729, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/workspace/vllm/engine/async_llm_engine.py", line 496, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/workspace/vllm/engine/async_llm_engine.py", line 269, in init self.engine = self._init_engine(*args, kwargs) File "/workspace/vllm/engine/async_llm_engine.py", line 314, in _init_engine return engine_class(*args, *kwargs) File "/workspace/vllm/engine/llm_engine.py", line 108, in init self._init_workers_ray(placement_group) File "/workspace/vllm/engine/llm_engine.py", line 195, in _init_workers_ray self._run_workers( File "/workspace/vllm/engine/llm_engine.py", line 755, in _run_workers self._run_workers_in_batch(workers, method, args, kwargs)) File "/workspace/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch all_outputs = ray.get(all_outputs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2563, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ChunkedEncodingError): ray::RayWorkerVllm.execute_method() (pid=1334, actor_id=0eaf3e9a36f1ca115ba4106101000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f477f445840>) File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 833, in _raw_read raise IncompleteRead(self._fp_bytes_read, self.length_remaining) urllib3.exceptions.IncompleteRead: IncompleteRead(2720359157 bytes read, 2172450427 more expected)

The above exception was the direct cause of the following exception:

ray::RayWorkerVllm.execute_method() (pid=1334, actor_id=0eaf3e9a36f1ca115ba4106101000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f477f445840>) File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 816, in generate yield from self.raw.stream(chunk_size, decode_content=True) File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 934, in stream data = self.read(amt=amt, decode_content=decode_content) File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 905, in read data = self._raw_read(amt) File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 811, in _raw_read with self._error_catcher(): File "/usr/lib/python3.10/contextlib.py", line 153, in exit self.gen.throw(typ, value, traceback) File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 729, in _error_catcher raise ProtocolError(f"Connection broken: {e!r}", e) from e urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(2720359157 bytes read, 2172450427 more expected)', IncompleteRead(2720359157 bytes read, 2172450427 more expected))

During handling of the above exception, another exception occurred:

ray::RayWorkerVllm.execute_method() (pid=1334, actor_id=0eaf3e9a36f1ca115ba4106101000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f477f445840>) File "/workspace/vllm/engine/ray_utils.py", line 31, in execute_method return executor(*args, kwargs) File "/workspace/vllm/worker/worker.py", line 79, in load_model self.model_runner.load_model() File "/workspace/vllm/worker/model_runner.py", line 57, in load_model self.model = get_model(self.model_config) File "/workspace/vllm/model_executor/model_loader.py", line 72, in get_model model.load_weights(model_config.model, model_config.download_dir, File "/workspace/vllm/model_executor/models/mixtral.py", line 407, in load_weights for name, loaded_weight in hf_model_weights_iterator( File "/workspace/vllm/model_executor/weight_utils.py", line 198, in hf_model_weights_iterator hf_folder, hf_weights_files, use_safetensors = prepare_hf_model_weights( File "/workspace/vllm/model_executor/weight_utils.py", line 155, in prepare_hf_model_weights hf_folder = snapshot_download(model_name_or_path, File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/_snapshot_download.py", line 238, in snapshot_download thread_map( File "/usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py", line 69, in thread_map return _executor_map(ThreadPoolExecutor, fn, iterables, tqdm_kwargs) File "/usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), kwargs)) File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1170, in iter for obj in iterable: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, *self.kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/_snapshot_download.py", line 213, in _inner_hf_hub_download return hf_hub_download( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1461, in hf_hub_download http_get( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 541, in http_get for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE): File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 818, in generate raise ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(2720359157 bytes read, 2172450427 more expected)', IncompleteRead(2720359157 bytes read, 2172450427 more expected))

Thanks for your help !

rakesgi2022 commented 11 months ago

My docker-compose :

version: '3' services: vllm-openai:

privileged: true

image: vllm/vllm-openai:latest
environment:
  - HUGGING_FACE_HUB_TOKEN=
ports:
  - "8000:8000"
ipc: host
volumes:
  - /var/lib/vllm/cache/huggingface:/root/.cache/huggingface
command: ["--chat-template", "vllm.entrypoints.openai.api_server", "--model", "mistralai/Mixtral-8x7B-Instruct-v0.1", "--dtype", "half", "--gpu-memory-utilization", "1", "--load-format", "safetensors", "--tensor-parallel-size", "2", "--worker-use-ray"]
  #    environment:
  #- NVIDIA_VISIBLE_DEVICES=all
  #- NVIDIA_DRIVER_CAPABILITIES=compute,utility
runtime: nvidia
deploy:
  resources:
    #limits:
      #memory: 15g
        #reservations:
      #          memory: 2g
    reservations:
      devices:
      - driver: nvidia
        count: all
        capabilities: [gpu]
bermeitinger-b commented 11 months ago

I guess the download failed. You could try removing the downloaded files and retry.

rakesgi2022 commented 11 months ago

I guess the download failed. You could try removing the downloaded files and retry.

Probably after restarting, it works but this problem is not very clear, it looks like network congestion.