Reading in a large input called min (~5GBytes) times out after approximately 30 minutes. I think the problem is with the processes on the host node. We need to think about the initial data distribution across GPUs. The loading of data for one process is about 11:38 minutes. Still too long.
Start loading data: 2023-04-08 20:34:33.776725
Start loading data: 2023-04-08 20:34:33.786735
Start loading data: 2023-04-08 20:34:33.795761
Start loading data: 2023-04-08 20:34:33.798504
Start loading data: 2023-04-08 20:34:33.805028
Start loading data: 2023-04-08 20:34:33.814498
Start loading data: 2023-04-08 20:34:33.815777
Start loading data: 2023-04-08 20:34:33.824660
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285451 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285452 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285453 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285454 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285455 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285457 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285458 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 285450) of binary: /mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/bin/python
Traceback (most recent call last):
File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')())
File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_direct_ddp.py FAILED
Failures:
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-08_21:08:11
host : mn1.pearl.scd.stfc.ac.uk
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 285450)
error_file:
traceback : Signal 9 (SIGKILL) received by PID 285450
=======================================================
We have refactored the code in #29 to address this issue. The loading is now significantly faster. It takes about ~2.5 mins on PEARL for the Water Minima dataset.
Reading in a large input called min (~5GBytes) times out after approximately 30 minutes. I think the problem is with the processes on the host node. We need to think about the initial data distribution across GPUs. The loading of data for one process is about 11:38 minutes. Still too long.