juripapay / hydronet_parallel

Parallel version of Hydronet
1 stars 0 forks source link

Error with large dataset #4

Open juripapay opened 1 year ago

juripapay commented 1 year ago

Running hydronet parallel with large dataset fails on GPU=2. (hydroFinal) pearl061@mn2:~/swDev/hydronet_parallel$ torchrun --standalone --nnodes=1 --nproc_per_node=2 train_direct_ddp.py --savedir './test_train_ddp2' --args 'train_args_min.json' WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Start running SchNet on rank 0. Start running SchNet on rank 1. [2427548]: world_size = 2, rank = 0, backend=nccl num_gpus: 2 Running the DDP model [2427549]: world_size = 2, rank = 1, backend=nccl num_gpus: 2 Running the DDP model libibverbs: Warning: couldn't load driver 'librxe-rdmav34.so': librxe-rdmav34.so: cannot open shared object file: No such file or directory libibverbs: Warning: couldn't load driver 'libmlx4-rdmav34.so': libmlx4-rdmav34.so: cannot open shared object file: No such file or directory libibverbs: Warning: couldn't load driver 'librxe-rdmav34.so': librxe-rdmav34.so: cannot open shared object file: No such file or directory libibverbs: Warning: couldn't load driver 'libmlx4-rdmav34.so': libmlx4-rdmav34.so: cannot open shared object file: No such file or directory Using device Using device cudacuda

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2427549 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 2427548) of binary: /mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/bin/python Traceback (most recent call last): File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')()) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_direct_ddp.py FAILED

Failures:

-------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-11_13:08:53 host : mn2.pearl.scd.stfc.ac.uk rank : 0 (local_rank: 0) exitcode : -9 (pid: 2427548) error_file: traceback : Signal 9 (SIGKILL) received by PID 2427548 ========================================================