FastMRI Pytorch Floating point exception

priyakasimbeg commented 1 year ago

Update: below is the original ResourceExhausted error. I passed in a flag to increase the shared memory and am getting a FloatingPoint error for some reason. See comment 1 for flag

Original: FastMRI Pytorch workload broken with resource exhausted error in data iterator.

Description

Traceback

I0309 22:05:40.139174 140064564143936 checkpoint_utils.py:240] Saved checkpoint to /experiment_runs/timing/fastmri_pytorch/trial_1/checkpoint_1517.
I0309 22:07:00.167249 140064564143936 spec.py:298] Evaluating on the training split.
Traceback (most recent call last):
  File "submission_runner.py", line 599, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "submission_runner.py", line 572, in main
    score = score_submission_on_workload(workload,
  File "submission_runner.py", line 507, in score_submission_on_workload
    timing, metrics = train_once(workload, global_batch_size,
  File "submission_runner.py", line 351, in train_once
    latest_eval_result = workload.eval_model(global_eval_batch_size,
  File "/algorithmic-efficiency/algorithmic_efficiency/spec.py", line 299, in eval_model
    train_metrics = self._eval_model_on_split(
  File "/algorithmic-efficiency/algorithmic_efficiency/workloads/fastmri/fastmri_pytorch/workload.py", line 238, in _eval_model_on_split
    batch = next(self._eval_iters[split])
  File "/algorithmic-efficiency/algorithmic_efficiency/workloads/fastmri/fastmri_pytorch/workload.py", line 51, in _build_input_queue
    batch = next(np_iter)  # pylint: disable=stop-iteration-return
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 766, in __next__
    return self._next_internal()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 749, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3017, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 7164, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Failed to allocate memory for the batch of component 0 [Op:IteratorGetNext]
2023-03-09 22:07:54.431683: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2023-03-09 22:07:55.019653: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

Steps to Reproduce

On kasimbeg-6 run:

docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs --gpus all us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image -b true -d fastmri -f pytorch -s reference_algorithms/target_setting_algorithms/pytorch_nesterov.py -w fastmri -t reference_algorithms/target_setting_algorithms/fastmri/tuning_search_space.json -e timing -m 36189 -b true

priyakasimbeg commented 1 year ago

Ran the container with --ipc=host flag to increase the shared memory but get this error now:

I0309 23:04:55.845499 140252844889856 logging_writer.py:48] [1428] global_step=1428, preemption_count=0, score=498.206863, test/loss=0.303527, test/num_examples=3581, test/ssim=0.721832, total_duration=854.198534, train/loss=0.290614, train/ssim=0.718649, validation/loss=0.301838, validation/num_examples=3554, validation/ssim=0.704187
I0309 23:04:55.993542 140327985198912 checkpoint_utils.py:240] Saved checkpoint to /experiment_runs/timing/fastmri_pytorch/trial_1/checkpoint_1428.
I0309 23:05:17.257784 140252693919488 logging_writer.py:48] [1500] global_step=1500, grad_norm=0.460632, loss=2.454625
I0309 23:05:17.281060 140327985198912 pytorch_submission_base.py:86] 1500) loss = 2.455, grad_norm = 0.461
I0309 23:06:16.078876 140327985198912 spec.py:298] Evaluating on the training split.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node '86591afb6a36_9_0' has failed to send a keep-alive heartbeat to the rendezvous '85fd650a-5131-47a6-9070-c1ac9c922376' due to an error of type RendezvousTimeoutError.
[E ProcessGroupNCCL.cpp:821] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805169 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805166 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805166 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805170 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805166 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805166 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18209, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805167 milliseconds before timing out.
Fatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: Floating point exceptionFloating point exceptionFloating point exceptionFloating point exception

Fatal Python error: Floating point exception

Full log here in bucket mlcommons-runs/timing/fastmri_pytorch_03-09-2023-22-49-59.log

priyakasimbeg commented 1 year ago

If trying to reproduce the error with the docker command above, please change the experiment directory flag from timing to debugging or something similar since the results will be transferred to GCP buckets under the experiment name. Also to avoid memory errors set the --ipc=host flag increase the shared memory on the container.

So to reproduce above error you can run:

docker run -t -d -v /home/kasimbeg/data/:/data/ -v /home/kasimbeg/experiment_runs/:/experiment_runs -v /home/kasimbeg/experiment_runs/logs:/logs --ipc=host --gpus all us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/base_image -b true -d fastmri -f pytorch  -s reference_algorithms/target_setting_algorithms/pytorch_nesterov.py -w fastmri -t reference_algorithms/target_setting_algorithms/fastmri/tuning_search_space.json  -e debugging -m 36189 -b true

mlcommons / algorithmic-efficiency

FastMRI Pytorch Floating point exception #344

Description

Steps to Reproduce