lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.91k stars 4.55k forks source link

Fail to start vllm_worker for codellama/CodeLlama-7b-Instruct-hf on two T4 gpus #2490

Open bugzyz opened 1 year ago

bugzyz commented 1 year ago

Hi there, I'm trying to run vllm_worker for codellama/CodeLlama-7b-Instruct-hf on 2 T4 gpus. But encountered ray.exceptions.RayActorError failure. Could you please provide any suggestion on this? Thanks!

version

(base) [root@5f44cb3f0202 mlserver]# python -V
Python 3.8.16
(base) [root@5f44cb3f0202 mlserver]# pip list | grep vllm
vllm                         0.1.7
(base) [root@5f44cb3f0202 mlserver]# pip list | grep fschat
fschat                       0.2.29
(base) [root@5f44cb3f0202 mlserver]# nvidia-smi
Thu Sep 28 09:45:28 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8    10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   39C    P8     9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

command

python3 -m fastchat.serve.vllm_worker --model-path codellama_model_and_tokenizer --model-names CodeLlama-7b-Instruct-hf --dtype float --num-gpus 2

error log

(base) [root@5f44cb3f0202 mlserver]# python3 -m fastchat.serve.vllm_worker --model-path codellama_model_and_tokenizer --model-names CodeLlama-7b-Instruct-hf --dtype float --num-gpus 2
2023-09-28 09:22:00 | INFO | root | Failed to detect number of TPUs: [Errno 2] No such file or directory: '/dev/vfio'
2023-09-28 09:22:00,785 WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 46964736 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-09-28 09:22:01,935 INFO worker.py:1642 -- Started a local Ray instance.
INFO 09-28 09:22:03 llm_engine.py:72] Initializing an LLM engine with config: model='codellama_model_and_tokenizer', tokenizer='codellama_model_and_tokenizer', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float32, download_dir=None, load_format=auto, tensor_parallel_size=2, seed=0)
INFO 09-28 09:22:03 tokenizer.py:30] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2023-09-28 09:22:07 | ERROR | stderr | (RayWorker pid=3345) [W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:49345 (errno: 97 - Address family not supported by protocol).
2023-09-28 09:22:07 | ERROR | stderr | (RayWorker pid=3345) [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [::ffff:172.17.0.3]:49345 (errno: 97 - Address family not supported by protocol).
2023-09-28 09:22:07 | ERROR | stderr | (RayWorker pid=3345) [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [::ffff:172.17.0.3]:49345 (errno: 97 - Address family not supported by protocol).
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345) *** SIGBUS received at time=1695892954 on cpu 11 ***
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345) PC: @     0x7f6f9d8ca18a  (unknown)  __memset_avx2_unaligned_erms
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345)     @     0x7f6f9d864df0       1536  (unknown)
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345)     @ 0x6a302d6c63636e2f  (unknown)  (unknown)
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345) [2023-09-28 09:22:34,865 E 3345 3539] logging.cc:361: *** SIGBUS received at time=1695892954 on cpu 11 ***
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345) [2023-09-28 09:22:34,865 E 3345 3539] logging.cc:361: PC: @     0x7f6f9d8ca18a  (unknown)  __memset_avx2_unaligned_erms
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345) [2023-09-28 09:22:34,868 E 3345 3539] logging.cc:361:     @     0x7f6f9d864df0       1536  (unknown)
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345) [2023-09-28 09:22:34,872 E 3345 3539] logging.cc:361:     @ 0x6a302d6c63636e2f  (unknown)  (unknown)
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345) Fatal Python error: Bus error
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3345)
2023-09-28 09:22:34 | ERROR | stderr | (RayWorker pid=3346) [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [::ffff:172.17.0.3]:49345 (errno: 97 - Address family not supported by protocol). [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
2023-09-28 09:22:35,188 WARNING worker.py:2058 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff16efa8446e4ab2f0dbe145f501000000 Worker ID: bcc3e923133d9ab28d3594e98900db33c927521295a59d60c1777904 Node ID: b64ebef98b0a0e78beb8320c9368b98da41d13f65cd43583ba33e19a Worker IP address: 172.17.0.3 Worker port: 36907 Worker PID: 3346 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-09-28 09:22:35 | ERROR | stderr | Traceback (most recent call last):
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
2023-09-28 09:22:35 | ERROR | stderr |     return _run_code(code, main_globals, None,
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
2023-09-28 09:22:35 | ERROR | stderr |     exec(code, run_globals)
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/fastchat/serve/vllm_worker.py", line 215, in <module>
2023-09-28 09:22:35 | ERROR | stderr |     engine = AsyncLLMEngine.from_engine_args(engine_args)
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 442, in from_engine_args
2023-09-28 09:22:35 | ERROR | stderr |     engine = cls(engine_args.worker_use_ray,
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 250, in __init__
2023-09-28 09:22:35 | ERROR | stderr |     self.engine = self._init_engine(*args, **kwargs)
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 279, in _init_engine
2023-09-28 09:22:35 | ERROR | stderr |     return engine_class(*args, **kwargs)
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 105, in __init__
2023-09-28 09:22:35 | ERROR | stderr |     self._init_cache()
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 185, in _init_cache
2023-09-28 09:22:35 | ERROR | stderr |     num_blocks = self._run_workers(
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 682, in _run_workers
2023-09-28 09:22:35 | ERROR | stderr |     all_outputs = ray.get(all_outputs)
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
2023-09-28 09:22:35 | ERROR | stderr |     return fn(*args, **kwargs)
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
2023-09-28 09:22:35 | ERROR | stderr |     return func(*args, **kwargs)
2023-09-28 09:22:35 | ERROR | stderr |   File "/opt/conda/lib/python3.8/site-packages/ray/_private/worker.py", line 2549, in get
2023-09-28 09:22:35 | ERROR | stderr |     raise value
2023-09-28 09:22:35 | ERROR | stderr | ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2023-09-28 09:22:35 | ERROR | stderr |  class_name: RayWorker
2023-09-28 09:22:35 | ERROR | stderr |  actor_id: 16efa8446e4ab2f0dbe145f501000000
2023-09-28 09:22:35 | ERROR | stderr |  pid: 3346
2023-09-28 09:22:35 | ERROR | stderr |  namespace: d2d11456-7c59-4958-9848-78a47efd36d8
2023-09-28 09:22:35 | ERROR | stderr |  ip: 172.17.0.3
2023-09-28 09:22:35 | ERROR | stderr | The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346) *** SIGBUS received at time=1695892954 on cpu 5 ***
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346) PC: @     0x7f8e0b4a818a  (unknown)  __memset_avx2_unaligned_erms
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346)     @     0x7f8e0b442df0       1536  (unknown)
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346)     @ 0x726e2d6c63636e2f  (unknown)  (unknown)
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346) [2023-09-28 09:22:34,865 E 3346 3540] logging.cc:361: *** SIGBUS received at time=1695892954 on cpu 5 ***
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346) [2023-09-28 09:22:34,865 E 3346 3540] logging.cc:361: PC: @     0x7f8e0b4a818a  (unknown)  __memset_avx2_unaligned_erms
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346) [2023-09-28 09:22:34,868 E 3346 3540] logging.cc:361:     @     0x7f8e0b442df0       1536  (unknown)
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346) [2023-09-28 09:22:34,872 E 3346 3540] logging.cc:361:     @ 0x726e2d6c63636e2f  (unknown)  (unknown)
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346) Fatal Python error: Bus error
2023-09-28 09:22:35 | ERROR | stderr | (RayWorker pid=3346)
2023-09-28 09:22:37 | INFO | stdout |
harshitpatni commented 1 year ago

have you tried with 1 gpu? it worked for me serving llama-2 7b-chat-hf but ran into same issue when i was trying to serve 70b on multiple gpus

bugzyz commented 1 year ago

have you tried with 1 gpu? it worked for me serving llama-2 7b-chat-hf but ran into same issue when i was trying to serve 70b on multiple gpus

Yes, 1 GPU is good but 2/4 GPUs will fail.

harshitpatni commented 1 year ago

right i got similar error using vllm with multiple gpus, haven't had chance to dig deep

surak commented 1 year ago

I realized that vllm takes more memory than the normal worker. It's much faster, but I need more gpus - ah, and they need to be in powers of two :-)

A llama-30 can run in 2 gpus of 24gb in 8 bits, and 3 gpus without the 8bit option. But needs 4 gpus for vllm!

infwinston commented 1 year ago

Tag vLLM ppl for help. @WoosukKwon @zhuohan123