Min data - Githubissues

Reading in a large input called min (~5GBytes) times out after approximately 30 minutes. I think the problem is with the processes on the host node. We need to think about the initial data distribution across GPUs. The loading of data for one process is about 11:38 minutes. Still too long.

Start loading data: 2023-04-08 20:34:33.776725 Start loading data: 2023-04-08 20:34:33.786735 Start loading data: 2023-04-08 20:34:33.795761 Start loading data: 2023-04-08 20:34:33.798504 Start loading data: 2023-04-08 20:34:33.805028 Start loading data: 2023-04-08 20:34:33.814498 Start loading data: 2023-04-08 20:34:33.815777 Start loading data: 2023-04-08 20:34:33.824660 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285451 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285452 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285453 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285454 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285455 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285457 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285458 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 285450) of binary: /mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/bin/python Traceback (most recent call last): File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')()) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_direct_ddp.py FAILED

Failures:
------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-08_21:08:11 host : mn1.pearl.scd.stfc.ac.uk rank : 0 (local_rank: 0) exitcode : -9 (pid: 285450) error_file: traceback : Signal 9 (SIGKILL) received by PID 285450 =======================================================

stfc-sciml / sciml-bench

Min data #28

train_direct_ddp.py FAILED