Closed zhaotyer closed 4 months ago
VLLM_TRACE_FUNCTION
should not be used unless you are debugging hang/crash.
VLLM_TRACE_FUNCTION
should not be used unless you are debugging hang/crash.
I turned it on because the service kept hang when it started
Then what is the last function Python executes? This should give you hint on why it hangs.
Then what is the last function Python executes? This should give you hint on why it hangs.
(RayWorkerWrapper pid=4292) INFO 06-11 05:00:27 utils.py:608] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2 [repeated 2x across cluster] INFO 06-11 05:00:59 selector.py:28] Using FlashAttention backend. (RayWorkerWrapper pid=4222) INFO 06-11 05:01:01 pynccl_utils.py:43] vLLM is using nccl==2.15.5 (RayWorkerWrapper pid=4086) INFO 06-11 05:00:55 selector.py:28] Using FlashAttention backend. [repeated 2x across cluster] INFO 06-11 05:01:01 pynccl_utils.py:43] vLLM is using nccl==2.15.5
It blocks after printing some NCCL logs
You can try the latest version. I don't remember exactly when VLLM_TRACE_FUNCTION
is enabled. When it is enabled, you should notice a logging message showing the trace file (which can be quite large).
You can try the latest version. I don't remember exactly when
VLLM_TRACE_FUNCTION
is enabled. When it is enabled, you should notice a logging message showing the trace file (which can be quite large).
vllm 0.4.1 VLLM_TRACE_FUNCTION is enabled
(RayWorkerWrapper pid=4095) WARNING 06-11 06:20:42 logger.py:125] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(RayWorkerWrapper pid=4095) INFO 06-11 06:20:42 logger.py:129] Trace frame log is saved to /tmp/vllm/vllm-instance-8c06bc620fd54e22a4644b512a767c97/VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log
tail -f VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log
2024-06-11 06:33:24.763706 Call to __getitem__ in /usr/local/lib/python3.8/dist-packages/torch/storage.py:308 from safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:271
2024-06-11 06:33:24.763831 Return from __getitem__ in /usr/local/lib/python3.8/dist-packages/torch/storage.py:311 to safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:271
2024-06-11 06:33:24.764090 Return from safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:272 to load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:343
2024-06-11 06:33:24.764189 Call to __getattribute__ in /usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py:260 from load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:346
2024-06-11 06:33:24.764243 Return from __getattribute__ in /usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py:263 to load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:346
2024-06-11 06:33:24.764362 Call to weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:596 from load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:366
2024-06-11 06:33:24.764422 Call to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:222 from weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597
2024-06-11 06:33:24.764470 Call to get_tensor_model_parallel_group in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:196 from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764518 Return from get_tensor_model_parallel_group in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:200 to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764546 Call to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1512 from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764589 Call to _rank_not_in_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:747 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1529
2024-06-11 06:33:24.764617 Return from _rank_not_in_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:751 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1529
2024-06-11 06:33:24.764664 Call to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:974 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1532
2024-06-11 06:33:24.764706 Call to is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:948 from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:976
2024-06-11 06:33:24.764732 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950
2024-06-11 06:33:24.764792 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.764835 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.764858 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950
2024-06-11 06:33:24.764923 Return from is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950 to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:976
2024-06-11 06:33:24.764972 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981
2024-06-11 06:33:24.765019 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765043 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765082 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981
2024-06-11 06:33:24.765104 Return from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1532
2024-06-11 06:33:24.765147 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1533
2024-06-11 06:33:24.765188 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765212 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765250 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1533
2024-06-11 06:33:24.765319 Call to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:762 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536
2024-06-11 06:33:24.765346 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:777
2024-06-11 06:33:24.765391 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765442 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765466 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:777
2024-06-11 06:33:24.765509 Call to pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:779
2024-06-11 06:33:24.765560 Return from pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:779
2024-06-11 06:33:24.765586 Call to pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:781
2024-06-11 06:33:24.765629 Return from pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:781
2024-06-11 06:33:24.765689 Return from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:785 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536
2024-06-11 06:33:24.765733 Return from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536 to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.765758 Return from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224 to weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597
weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597
which model do you serve? what's the size? is it downloaded or not? it seems your code is still loading the model.
You can try the latest version. I don't remember exactly when
VLLM_TRACE_FUNCTION
is enabled. When it is enabled, you should notice a logging message showing the trace file (which can be quite large).vllm 0.4.1 VLLM_TRACE_FUNCTION is enabled
(RayWorkerWrapper pid=4095) WARNING 06-11 06:20:42 logger.py:125] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. (RayWorkerWrapper pid=4095) INFO 06-11 06:20:42 logger.py:129] Trace frame log is saved to /tmp/vllm/vllm-instance-8c06bc620fd54e22a4644b512a767c97/VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log
tail -f VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log
2024-06-11 06:33:24.763706 Call to __getitem__ in /usr/local/lib/python3.8/dist-packages/torch/storage.py:308 from safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:271 2024-06-11 06:33:24.763831 Return from __getitem__ in /usr/local/lib/python3.8/dist-packages/torch/storage.py:311 to safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:271 2024-06-11 06:33:24.764090 Return from safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:272 to load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:343 2024-06-11 06:33:24.764189 Call to __getattribute__ in /usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py:260 from load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:346 2024-06-11 06:33:24.764243 Return from __getattribute__ in /usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py:263 to load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:346 2024-06-11 06:33:24.764362 Call to weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:596 from load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:366 2024-06-11 06:33:24.764422 Call to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:222 from weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597 2024-06-11 06:33:24.764470 Call to get_tensor_model_parallel_group in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:196 from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224 2024-06-11 06:33:24.764518 Return from get_tensor_model_parallel_group in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:200 to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224 2024-06-11 06:33:24.764546 Call to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1512 from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224 2024-06-11 06:33:24.764589 Call to _rank_not_in_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:747 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1529 2024-06-11 06:33:24.764617 Return from _rank_not_in_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:751 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1529 2024-06-11 06:33:24.764664 Call to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:974 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1532 2024-06-11 06:33:24.764706 Call to is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:948 from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:976 2024-06-11 06:33:24.764732 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950 2024-06-11 06:33:24.764792 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 2024-06-11 06:33:24.764835 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 2024-06-11 06:33:24.764858 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950 2024-06-11 06:33:24.764923 Return from is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950 to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:976 2024-06-11 06:33:24.764972 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981 2024-06-11 06:33:24.765019 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 2024-06-11 06:33:24.765043 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 2024-06-11 06:33:24.765082 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981 2024-06-11 06:33:24.765104 Return from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1532 2024-06-11 06:33:24.765147 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1533 2024-06-11 06:33:24.765188 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 2024-06-11 06:33:24.765212 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 2024-06-11 06:33:24.765250 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1533 2024-06-11 06:33:24.765319 Call to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:762 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536 2024-06-11 06:33:24.765346 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:777 2024-06-11 06:33:24.765391 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 2024-06-11 06:33:24.765442 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 2024-06-11 06:33:24.765466 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:777 2024-06-11 06:33:24.765509 Call to pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:779 2024-06-11 06:33:24.765560 Return from pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:779 2024-06-11 06:33:24.765586 Call to pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:781 2024-06-11 06:33:24.765629 Return from pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:781 2024-06-11 06:33:24.765689 Return from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:785 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536 2024-06-11 06:33:24.765733 Return from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536 to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224 2024-06-11 06:33:24.765758 Return from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224 to weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597
The model weights are actually loaded onto four cards
what's your model size? it is possible that only parts of the model are loaded, and you need to wait for it to finish loading weights.
weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597
which model do you serve? what's the size? is it downloaded or not? it seems your code is still loading the model.
I use model is Qwen1.5-72B-chat, It has been downloaded locally
72B
can indeed take a long time to load. It is also possible that your disk read is slow.
72B
can indeed take a long time to load. It is also possible that your disk read is slow.
I load this model by hugging face transformers, jsut use 4-5minutes, It's probably not the hard drive.
how do you load it using transformers
?
how do you load it using
transformers
? the code is:def huggingface_init(self, base_model_config: dict = {}, lora_model_config: dict = {}): import torch from transformers import AutoTokenizer, AutoModel, TextIteratorStreamer, AutoModelForCausalLM from vllm.config import _get_and_verify_max_len self._tokenizer = AutoTokenizer.from_pretrained(self.base_model_path, trust_remote_code=True) self._model = AutoModelForCausalLM.from_pretrained(self.base_model_path, **base_model_config) # 当lora扩展了tokenizer的vocab时需要加载从lora路经加载self._tokenizer,并扩展基础模型的token_embeddings # self._tokenizer = AutoTokenizer.from_pretrained(self.lora_model_path, trust_remote_code=True) # self._model.resize_token_embeddings(len(self._tokenizer)) self._model_max_length = _get_and_verify_max_len(self._model.config, None) # 遍历并加载peft模型 for index, peft_path in enumerate(self.peft_folders): peft_name = os.path.basename(peft_path) logger.info(f"load peft model,name:{peft_name}") if peft_name in self.sub_models: error_info = f"peft:{peft_name} has been loaded, loaded model is:{self.sub_models}" logger.error(error_info) raise Exception(error_info)
self._model.load_adapter(peft_path, adapter_name=peft_name)
self.sub_models[peft_name] = {"path":peft_path, "index":next(self.counter)}
self._model.set_adapter(peft_name)
if index == 0:
self._is_lora_flag = True
if index == len(self.peft_folders)-1:
self.default_adapter = peft_name
logger.info("huggingface load model finished")
def vllm_init(self, base_model_config:dict={}):
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from transformers import PreTrainedTokenizerBase
self.lora_model_path = os.path.join(self.model_path, "local_model", "peft_model")
self.vllm_model_path = self.base_model_path
# 遍历peft模型
for index, peft_path in enumerate(self.peft_folders):
self._is_lora_flag = True
peft_name = os.path.basename(peft_path)
self.sub_models[peft_name] = {"path":peft_path, "index":next(self.counter)}
if index == len(self.peft_folders)-1:
self.default_adapter = peft_name
parser = argparse.ArgumentParser()
parser = AsyncEngineArgs.add_cli_args(parser)
args = parser.parse_args()
args.model = self.vllm_model_path
# Adjust according to model and GPU memory size
args.gpu_memory_utilization = env_manager.gpu_memory_utilization
cuda_env = env_manager.cuda_visible_devices
if cuda_env is None:
from torch.cuda import device_count
args.tensor_parallel_size = device_count()
else:
args.tensor_parallel_size = len(cuda_env.split(",")) if cuda_env else 1
args.trust_remote_code = base_model_config.get("trust_remote_code", False)
# args.dtype = 'auto'
args.enforce_eager = env_manager.enforce_eager
args.max_log_len = 50
args.enable_lora = self._is_lora_flag
engine_args = AsyncEngineArgs.from_cli_args(args)
self._model = AsyncLLMEngine.from_engine_args(engine_args)
if isinstance(self._model.engine.tokenizer, PreTrainedTokenizerBase):
self._tokenizer = self._model.engine.tokenizer
else:
self._tokenizer = self._model.engine.tokenizer.tokenizer
engine_model_config = self._model.engine.get_model_config()
self._model_max_length = engine_model_config.max_model_len
# Counter to keep track of ongoing request counts
self.ongoing_request_count = 0
self._loop = asyncio.get_event_loop()
self._loop_thread = Thread(
target=self.engine_loop, args=(self._loop,)
)
self._shutdown_event = asyncio.Event()
self._lock = asyncio.Lock()
self._request_id_dict = {}
self._loop_thread.start()
logger.info("vllm load model finished")
does transformers
load your model from disk to cpu or to gpu?
does
transformers
load your model from disk to cpu or to gpu?
gpu,The configuration is exactly the same as vllm
does
transformers
load your model from disk to cpu or to gpu?
gpu,The configuration is exactly the same as vllm
if i remember correctly, transformers does not support tensor parallel. how can you hold the model with 72B parameters 😕
tail -f VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log
if i remember correctly, transformers does not support tensor parallel. how can you hold the model with 72B parameters 😕
set device_map='auto'
when use AutoModelForCausalLM.from_pretrained(self.base_model_path, **base_model_config)
It is able to automatically load models onto four cards by pp
I believe current vLLM implementation will load model for tensor_parallel
times. So it is expected to take much longer time. If you want to shorten this, you can take a look at the documentation https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html .
I believe current vLLM implementation will load model for
tensor_parallel
times. So it is expected to take much longer time. If you want to shorten this, you can take a look at the documentation https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html .
It should have nothing to do with this. Now the weight of each shard is only 4G.
Now the weight of each shard is only 4G.
what is this 4G
?
You have a 72B model, with 144GB disk file. You use tensor_parallel_size=4
, which means 4 process will try to load the model together. In total you need to load 436GB data from disk to memory.
Now the weight of each shard is only 4G.
what is this
4G
?You have a 72B model, with 144GB disk file. You use
tensor_parallel_size=4
, which means 4 process will try to load the model together. In total you need to load 436GB data from disk to memory.
well, you have 38 files, each file has about 4 GB, in total summing up to about 144GB.
You use tensor_parallel_size=4, which means 4 process will try to load the model together. In total you need to load 436GB data from disk to memory.
I would say, slow loading in your case is expected. the documentation https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html can help to shard the weight according to tensor parallel, so that later you only need to load the corresponding part of weight.
I would say, slow loading in your case is expected. the documentation https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html can help to shard the weight according to tensor parallel, so that later you only need to load the corresponding part of weight.
?,In fact, vllm is currently loaded in slices, and it will not load 436G at the same time as you mentioned. It occupies up to 16G(4*4) at the same time, which has nothing to do with loading blocks.
well, it will not load 436G at the same time, but in the end it has to load 436GB from disk in total... if your disk read speed is 200MB/s, then you need 2000s to just read from disk.
well, it will not load 436G at the same time, but in the end it has to load 436GB from disk in total... if your disk read speed is 200MB/s, then you need 2000s to just read from disk.
The juicenfs bandwidth we are using now is indeed only 100M/s, but I don’t know why the GPU usage is 35316MiB in 1 minute for all 4 cards.
well, it will not load 436G at the same time, but in the end it has to load 436GB from disk in total... if your disk read speed is 200MB/s, then you need 2000s to just read from disk.
The juicenfs bandwidth we are using now is indeed only 100M/s, but I don’t know why the GPU usage is 35316MiB in 1 minute for all 4 cards.
Use torch.empty first The empty weight is assigned, then loaded into the CPU, and copied from the CPU to the previously allocated empty weight.
Your current environment
🐛 Describe the bug
I tested both with vllm 0.3.1 0.4.1, and the service startup blocked in nccl I hope you can find out the reason. I don't know much about nccl