Closed priyakasimbeg closed 1 year ago
Ran the container with --ipc=host flag to increase the shared memory but get this error now:
I0309 23:04:55.845499 140252844889856 logging_writer.py:48] [1428] global_step=1428, preemption_count=0, score=498.206863, test/loss=0.303527, test/num_examples=3581, test/ssim=0.721832, total_duration=854.198534, train/loss=0.290614, train/ssim=0.718649, validation/loss=0.301838, validation/num_examples=3554, validation/ssim=0.704187
I0309 23:04:55.993542 140327985198912 checkpoint_utils.py:240] Saved checkpoint to /experiment_runs/timing/fastmri_pytorch/trial_1/checkpoint_1428.
I0309 23:05:17.257784 140252693919488 logging_writer.py:48] [1500] global_step=1500, grad_norm=0.460632, loss=2.454625
I0309 23:05:17.281060 140327985198912 pytorch_submission_base.py:86] 1500) loss = 2.455, grad_norm = 0.461
I0309 23:06:16.078876 140327985198912 spec.py:298] Evaluating on the training split.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node '86591afb6a36_9_0' has failed to send a keep-alive heartbeat to the rendezvous '85fd650a-5131-47a6-9070-c1ac9c922376' due to an error of type RendezvousTimeoutError.
[E ProcessGroupNCCL.cpp:821] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805169 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805166 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805166 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805170 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805166 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805166 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805167 milliseconds before timing out.
Fatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: Floating point exceptionFloating point exceptionFloating point exceptionFloating point exception
Fatal Python error: Floating point exception
Full log here in bucket mlcommons-runs/timing/fastmri_pytorch_03-09-2023-22-49-59.log
If trying to reproduce the error with the docker command above, please change the experiment directory flag from timing to debugging
or something similar since the results will be transferred to GCP buckets under the experiment name. Also to avoid memory errors set the --ipc=host flag increase the shared memory on the container.
So to reproduce above error you can run:
docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs --ipc=host --gpus all us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image -b true -d fastmri -f pytorch -s reference_algorithms/target_setting_algorithms/pytorch_nesterov.py -w fastmri -t reference_algorithms/target_setting_algorithms/fastmri/tuning_search_space.json -e debugging -m 36189 -b true
Update: below is the original ResourceExhausted error. I passed in a flag to increase the shared memory and am getting a FloatingPoint error for some reason. See comment 1 for flag
Original: FastMRI Pytorch workload broken with resource exhausted error in data iterator.
Description
Traceback
Steps to Reproduce
On kasimbeg-6 run: