modelscope / swift

ms-swift: Use PEFT or Full-parameter to finetune 250+ LLMs or 35+ MLLMs. (Qwen2, GLM4, Internlm2, Yi, Llama3, Llava, MiniCPM-V, Deepseek, Baichuan2, Phi3-Vision, ...)
https://github.com/modelscope/swift/blob/main/docs/source/LLM/index.md
Apache License 2.0
2.09k stars 203 forks source link

load qwen110B model using get_vllm_engine throws error #1081

Open phoenixbai opened 2 weeks ago

phoenixbai commented 2 weeks ago

Describe the bug I try to load qwen110B model using below code for batch inference, but it throws error:

import os
import codecs
import json
from datasets import load_dataset

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
batch_size = 20
from swift.llm import (
    ModelType, get_vllm_engine, get_default_template_type,
    get_template, inference_vllm
)

model_type = ModelType.qwen1half_110b_chat
llm_engine = get_vllm_engine(model_type, tensor_parallel_size=8)

Your hardware and system info

>>> import transformers, swift, vllm, torch
>>> transformers.__version__
'4.40.1'
>>> swift.__version__
'2.0.5.post1'
>>> vllm.__version__
'0.4.3'
>>> torch.__version__
'2.3.0+cu121'

8 A100 gpu cards:

$nvidia-smi
Thu Jun  6 10:14:27 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

Additional context

error.log

running error logs is as below, detail log is also attached :

(RayWorkerWrapper pid=72466) INFO 06-05 15:40:23 pynccl.py:65] vLLM is using nccl==2.20.5 [repeated 6x across cluster]
INFO 06-05 15:40:31 custom_all_reduce_utils.py:169] generating GPU P2P access cache for in /home/admin/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
ERROR 06-05 15:40:31 worker_base.py:148] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 06-05 15:40:31 worker_base.py:148] Traceback (most recent call last):
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
ERROR 06-05 15:40:31 worker_base.py:148]     return executor(*args, **kwargs)
ERROR 06-05 15:40:31 worker_base.py:148]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 114, in init_device
ERROR 06-05 15:40:31 worker_base.py:148]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 349, in init_worker_distributed_environment
ERROR 06-05 15:40:31 worker_base.py:148]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 239, in ensure_model_parallel_initialized
ERROR 06-05 15:40:31 worker_base.py:148]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 200, in initialize_model_parallel
ERROR 06-05 15:40:31 worker_base.py:148]     _TP_CA_COMMUNICATOR = CustomAllreduce(
ERROR 06-05 15:40:31 worker_base.py:148]                           ^^^^^^^^^^^^^^^^
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 166, in __init__
ERROR 06-05 15:40:31 worker_base.py:148]     if not _can_p2p(rank, world_size):
ERROR 06-05 15:40:31 worker_base.py:148]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 73, in _can_p2p
ERROR 06-05 15:40:31 worker_base.py:148]     if not gpu_p2p_access_check(rank, i):
ERROR 06-05 15:40:31 worker_base.py:148]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 173, in gpu_p2p_access_check
ERROR 06-05 15:40:31 worker_base.py:148]     cache[f"{_i}->{_j}"] = can_actually_p2p(_i, _j)
ERROR 06-05 15:40:31 worker_base.py:148]                            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 123, in can_actually_p2p
ERROR 06-05 15:40:31 worker_base.py:148]     pi.start()
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/process.py", line 121, in start
ERROR 06-05 15:40:31 worker_base.py:148]     self._popen = self._Popen(self)
ERROR 06-05 15:40:31 worker_base.py:148]                   ^^^^^^^^^^^^^^^^^
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
ERROR 06-05 15:40:31 worker_base.py:148]     return Popen(process_obj)
ERROR 06-05 15:40:31 worker_base.py:148]            ^^^^^^^^^^^^^^^^^^
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
ERROR 06-05 15:40:31 worker_base.py:148]     super().__init__(process_obj)
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
ERROR 06-05 15:40:31 worker_base.py:148]     self._launch(process_obj)
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
ERROR 06-05 15:40:31 worker_base.py:148]     prep_data = spawn.get_preparation_data(process_obj._name)
ERROR 06-05 15:40:31 worker_base.py:148]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
ERROR 06-05 15:40:31 worker_base.py:148]     _check_not_importing_main()
ERROR 06-05 15:40:31 worker_base.py:148]   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
ERROR 06-05 15:40:31 worker_base.py:148]     raise RuntimeError('''
ERROR 06-05 15:40:31 worker_base.py:148] RuntimeError:
ERROR 06-05 15:40:31 worker_base.py:148]         An attempt has been made to start a new process before the
ERROR 06-05 15:40:31 worker_base.py:148]         current process has finished its bootstrapping phase.
ERROR 06-05 15:40:31 worker_base.py:148]
ERROR 06-05 15:40:31 worker_base.py:148]         This probably means that you are not using fork to start your
ERROR 06-05 15:40:31 worker_base.py:148]         child processes and you have forgotten to use the proper idiom
ERROR 06-05 15:40:31 worker_base.py:148]         in the main module:
ERROR 06-05 15:40:31 worker_base.py:148]
ERROR 06-05 15:40:31 worker_base.py:148]             if __name__ == '__main__':
ERROR 06-05 15:40:31 worker_base.py:148]                 freeze_support()
ERROR 06-05 15:40:31 worker_base.py:148]                 ...
ERROR 06-05 15:40:31 worker_base.py:148]
ERROR 06-05 15:40:31 worker_base.py:148]         The "freeze_support()" line can be omitted if the program
ERROR 06-05 15:40:31 worker_base.py:148]         is not going to be frozen to produce an executable.
ERROR 06-05 15:40:31 worker_base.py:148]
ERROR 06-05 15:40:31 worker_base.py:148]         To fix this issue, refer to the "Safe importing of main module"
ERROR 06-05 15:40:31 worker_base.py:148]         section in https://docs.python.org/3/library/multiprocessing.html
ERROR 06-05 15:40:31 worker_base.py:148]
033124186002:67553:67553 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
hi-033124186002:67553:67553 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
hi-033124186002:67553:67553 [0] NCCL INFO P2P Chunksize set to 524288
hi-033124186002:67553:67553 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/IPC/read
hi-033124186002:67553:67553 [0] NCCL INFO Connected all rings
hi-033124186002:67553:67553 [0] NCCL INFO Connected all trees
hi-033124186002:67553:67553 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hi-033124186002:67553:67553 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
hi-033124186002:67553:67553 [0] NCCL INFO comm 0xc0ee800 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0xe5f92fc2f88ef3bf - Init COMPLETE
[rank0]: Traceback (most recent call last):
[rank0]:   File "<string>", line 1, in <module>
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
[rank0]:     exitcode = _main(fd, parent_sentinel)
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
[rank0]:     prepare(preparation_data)
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
[rank0]:     _fixup_main_from_path(data['init_main_from_path'])
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
[rank0]:     main_content = runpy.run_path(main_path,
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "<frozen runpy>", line 291, in run_path
[rank0]:   File "<frozen runpy>", line 98, in _run_module_code
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/cainiao_swift_finetune/infer/infer_in_batch.py", line 14, in <module>
[rank0]:     llm_engine = get_vllm_engine(model_type, model_id_or_path="/mnt/nlp_nas_milvus/bmz/llm/models/Qwen1.5-110B-Chat", tensor_parallel_size=8)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/swift/llm/utils/vllm_utils.py", line 91, in get_vllm_engine
[rank0]:     llm_engine = llm_engine_cls.from_engine_args(engine_args)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 359, in from_engine_args
[rank0]:     engine = cls(
[error.log](https://github.com/user-attachments/files/15598816/error.log)

[rank0]:              ^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 222, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:                           ^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]:     driver_worker_output = self.driver_worker.execute_method(
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]:     raise e
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 114, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 349, in init_worker_distributed_environment
[rank0]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 239, in ensure_model_parallel_initialized
[rank0]:     initialize_model_parallel(tensor_model_parallel_size,
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 200, in initialize_model_parallel
[rank0]:     _TP_CA_COMMUNICATOR = CustomAllreduce(
[rank0]:                           ^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 166, in __init__
[rank0]:     if not _can_p2p(rank, world_size):
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 73, in _can_p2p
[rank0]:     if not gpu_p2p_access_check(rank, i):
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 173, in gpu_p2p_access_check
[rank0]:     cache[f"{_i}->{_j}"] = can_actually_p2p(_i, _j)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 123, in can_actually_p2p
[rank0]:     pi.start()
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/process.py", line 121, in start
[rank0]:     self._popen = self._Popen(self)
[rank0]:                   ^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
[rank0]:     return Popen(process_obj)
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
[rank0]:     super().__init__(process_obj)
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
[rank0]:     self._launch(process_obj)
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
[rank0]:     prep_data = spawn.get_preparation_data(process_obj._name)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
[rank0]:     _check_not_importing_main()
[rank0]:   File "/mnt/nlp_nas_milvus/bmz/llm/anaconda3/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
[rank0]:     raise RuntimeError('''
[rank0]: RuntimeError:
[rank0]:         An attempt has been made to start a new process before the
[rank0]:         current process has finished its bootstrapping phase.

[rank0]:         This probably means that you are not using fork to start your
[rank0]:         child processes and you have forgotten to use the proper idiom
[rank0]:         in the main module:

[rank0]:             if __name__ == '__main__':
[rank0]:                 freeze_support()
[rank0]:                 ...

[rank0]:         The "freeze_support()" line can be omitted if the program
[rank0]:         is not going to be frozen to produce an executable.

[rank0]:         To fix this issue, refer to the "Safe importing of main module"
[rank0]:         section in https://docs.python.org/3/library/multiprocessing.html
[rank0]:
hi-033124186002:67554:72752 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
hi-033124186002:67554:72752 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_3:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_11:1/RoCE [RO]; OOB eth0:33.124.186.2<0>
hi-033124186002:67554:72752 [0] NCCL INFO Using non-device net plugin version 0
Jintao-Huang commented 2 weeks ago

df -h 看看共享内存是不是有点小