thunlp / OpenKE

An Open-Source Package for Knowledge Embedding (KE)
3.81k stars 984 forks source link

使用openke2.0中的train_rotate_FB15K237_dist.py进行分布式训练时报错 #410

Open pipiyapi opened 3 months ago

pipiyapi commented 3 months ago

你好,我在使用openke2.0中的train_rotate_FB15K237_dist.py时出现以下报错,请问有什么解决办法吗?非常希望得到帮助。 Input Files Path : ./benchmarks/data-390/ The toolkit is importing datasets. The total of relations is 28. The total of entities is 700324. Input Files Path : ./benchmarks/data-390/ The toolkit is importing datasets. The total of relations is 28. The total of entities is 700324. The total of train triples is 2849846. The total of train triples is 2849846. Input Files Path : ./benchmarks/data-390/ Input Files Path : ./benchmarks/data-390/ The total of test triples is 258713. The total of valid triples is 1293564. The total of test triples is 258713. The total of valid triples is 1293564. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 2646564) of binary: /home/jupyter-xingcheng/.conda/envs/openke/bin/python3.8 Traceback (most recent call last): File "/home/jupyter-xingcheng/.conda/envs/openke/bin/torchrun", line 8, in sys.exit(main()) File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_rotate_data_390_dist.py FAILED

Failures: [1]: time : 2024-06-17_13:53:46 host : dell rank : 1 (local_rank: 1) exitcode : -11 (pid: 2646565) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 2646565

Root Cause (first observed failure): [0]: time : 2024-06-17_13:53:46 host : dell rank : 0 (local_rank: 0) exitcode : -11 (pid: 2646564) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 2646564

运行的命令是:WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port 1234 train_rotate_data_390_dist.py

pipiyapi commented 3 months ago

上面问题解决了,是由于我的数据有误,但分布式训练又遇到新问题,分布式只有一张卡工作,但另一张卡也是gpu满的。 (openke) jupyter-xingcheng@dell:~/OpenKE2.0$ python -m torch.distributed.launch --nproc_per_node 2 train_rotate_data_390_dist.py /home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Input Files Path : ./benchmarks/data-390/ The toolkit is importing datasets. The total of relations is 28. The total of entities is 700324. Input Files Path : ./benchmarks/data-390/ The toolkit is importing datasets. The total of relations is 28. The total of entities is 700324. The total of train triples is 2849846. The total of train triples is 2849846. Input Files Path : ./benchmarks/data-390/ Input Files Path : ./benchmarks/data-390/ The total of test triples is 258712. The total of valid triples is 1293564. The total of test triples is 258712. The total of valid triples is 1293564. Finish initializing... 0%| | 0/6000 [00:00<?, ?it/s]Finish initializing... Epoch 0 | loss: 1141.047029: 0%| | 1/6000 [03:04<307:32:46, 184.56s/it

以下是nvidi-smi使用情况: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:3B:00.0 Off | N/A | | 70% 86C P2 297W / 350W | 22428MiB / 24576MiB | 89% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:D8:00.0 Off | N/A | | 88% 88C P2 278W / 350W | 22428MiB / 24576MiB | 89% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3311491 C ...cheng/.conda/envs/openke/bin/python 22422MiB | | 1 N/A N/A 3311492 C ...cheng/.conda/envs/openke/bin/python 22422MiB | +-----------------------------------------------------------------------------------------+