stfc-sciml / sciml-bench

SciML Benchmarking Suite for AI for Science
MIT License
38 stars 13 forks source link

Min data #28

Closed juripapay closed 1 year ago

juripapay commented 1 year ago

Reading in a large input called min (~5GBytes) times out after approximately 30 minutes. I think the problem is with the processes on the host node. We need to think about the initial data distribution across GPUs. The loading of data for one process is about 11:38 minutes. Still too long.

Start loading data: 2023-04-08 20:34:33.776725 Start loading data: 2023-04-08 20:34:33.786735 Start loading data: 2023-04-08 20:34:33.795761 Start loading data: 2023-04-08 20:34:33.798504 Start loading data: 2023-04-08 20:34:33.805028 Start loading data: 2023-04-08 20:34:33.814498 Start loading data: 2023-04-08 20:34:33.815777 Start loading data: 2023-04-08 20:34:33.824660 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285451 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285452 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285453 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285454 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285455 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285457 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285458 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 285450) of binary: /mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/bin/python Traceback (most recent call last): File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')()) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/beegfs/home/pearl061/.conda/envs/hydroFinal/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_direct_ddp.py FAILED

Failures:

------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-08_21:08:11 host : mn1.pearl.scd.stfc.ac.uk rank : 0 (local_rank: 0) exitcode : -9 (pid: 285450) error_file: traceback : Signal 9 (SIGKILL) received by PID 285450 =======================================================
samueljackson92 commented 1 year ago

We have refactored the code in #29 to address this issue. The loading is now significantly faster. It takes about ~2.5 mins on PEARL for the Water Minima dataset.