[core][AIR output] Weird output at the end of the training

What happened + What you expected to happen

I ran this torch fine-tuning resnet example: https://docs.ray.io/en/latest/train/examples/pytorch/pytorch_resnet_finetune.html

The full output is attached below.

The additional output at the end is the same as that at relatively the beginning of the script. This feels like a bug to me. The entire training takes relatively a long time, it shouldn't be entirely due to fast training.

(RayTrainWorker pid=4566, ip=10.0.0.188) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
  0%|          | 0.00/97.8M [00:00<?, ?B/s] [repeated 3x across cluster]
 71%|███████▏  | 69.9M/97.8M [00:00<00:00, 256MB/s] [repeated 9x across cluster]
100%|██████████| 97.8M/97.8M [00:00<00:00, 256MB/s] [repeated 2x across cluster]
(RayTrainWorker pid=4533, ip=10.0.0.186) Moving model to device: cuda:0 [repeated 3x across cluster]
(RayTrainWorker pid=4566, ip=10.0.0.188) Wrapping provided model in DistributedDataParallel. [repeated 3x across cluster]

(base)  ray@g-c2c56abb267620001:~/default$ /home/ray/anaconda3/bin/python /home/ray/default/test-15.py
2023-06-27 09:26:47,550 INFO worker.py:1427 -- Connecting to existing Ray cluster at address: 10.0.0.28:6379...
2023-06-27 09:26:47,558 INFO worker.py:1608 -- Connected to Ray cluster. View the dashboard at https://session-kv6qd59td8w6e5rd7lz81ki4np.i.anyscaleuserdata-staging.com 
2023-06-27 09:26:48,031 INFO packaging.py:346 -- Pushing file package 'gcs://_ray_pkg_c8edb18e2d92888e0a3d0c2c1de789e1.zip' (168.93MiB) to Ray cluster...
2023-06-27 09:26:48,608 INFO packaging.py:359 -- Successfully pushed file package 'gcs://_ray_pkg_c8edb18e2d92888e0a3d0c2c1de789e1.zip'.
2023-06-27 09:26:49,628 INFO tune.py:226 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2023-06-27 09:26:49,630 INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://docs.ray.io/en/master/ray-air/experimental-features.html

View detailed results here: /tmp/ray_results/finetune-resnet
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray_results/finetune-resnet`

(autoscaler +54s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +54s) Adding 4 node(s) of type worker-node-type-1.
2023-06-27 09:27:50,289 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 5.0 CPUs and 4.0 GPUs, but the cluster only has 32.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +1m34s) Resized to 96 CPUs, 4 GPUs.
Training started with configuration:
╭──────────────────────────────────────╮
│ Training config                      │
├──────────────────────────────────────┤
│ train_loop_config/batch_size      32 │
│ train_loop_config/input_size     224 │
│ train_loop_config/lr           0.001 │
│ train_loop_config/momentum       0.9 │
│ train_loop_config/num_epochs      10 │
╰──────────────────────────────────────╯

(TorchTrainer pid=4541, ip=10.0.0.187) Starting distributed worker processes: ['4601 (10.0.0.187)', '4388 (10.0.0.184)', '4533 (10.0.0.186)', '4566 (10.0.0.188)']
(RayTrainWorker pid=4601, ip=10.0.0.187) Setting up process group for: env:// [rank=0, world_size=4]
(RayTrainWorker pid=4388, ip=10.0.0.184) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
  0%|          | 0.00/97.8M [00:00<?, ?B/s]
 20%|█▉        | 19.5M/97.8M [00:00<00:00, 204MB/s]
100%|██████████| 97.8M/97.8M [00:00<00:00, 331MB/s]
(RayTrainWorker pid=4388, ip=10.0.0.184) Moving model to device: cuda:0
(RayTrainWorker pid=4388, ip=10.0.0.184) Wrapping provided model in DistributedDataParallel.
Training finished iteration 1 at 2023-06-27 09:28:45. Total running time: 1min 55s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                       17.558 │
│ time_total_s                           17.558 │
│ training_iteration                          1 │
│ acc                                         1 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.49166 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 1 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000000

2023-06-27 09:28:53,085 WARNING util.py:315 -- The `process_trial_save` operation took 7.106 s, which may be a performance bottleneck.
2023-06-27 09:28:53,085 WARNING trial_runner.py:928 -- Consider turning off forced head-worker trial checkpoint syncs by setting sync_on_checkpoint=False. Note that this may result in faulty trial restoration if a failure occurs while the checkpoint is being synced from the worker to the head node.
Training finished iteration 2 at 2023-06-27 09:28:54. Total running time: 2min 4s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      8.63442 │
│ time_total_s                          26.1924 │
│ training_iteration                          2 │
│ acc                                         1 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.32971 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 2 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000001

2023-06-27 09:28:57,798 WARNING util.py:315 -- The `process_trial_save` operation took 3.188 s, which may be a performance bottleneck.
Training finished iteration 3 at 2023-06-27 09:28:59. Total running time: 2min 9s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                       4.5636 │
│ time_total_s                           30.756 │
│ training_iteration                          3 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.24576 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 3 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000002

2023-06-27 09:29:02,186 WARNING util.py:315 -- The `process_trial_save` operation took 3.013 s, which may be a performance bottleneck.
Training finished iteration 4 at 2023-06-27 09:29:03. Total running time: 2min 13s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                       4.4174 │
│ time_total_s                          35.1734 │
│ training_iteration                          4 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.19502 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 4 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000003

(autoscaler +2m20s) Adding 4 node(s) of type worker-node-type-1.
2023-06-27 09:29:06,546 WARNING util.py:315 -- The `process_trial_save` operation took 2.955 s, which may be a performance bottleneck.
Training finished iteration 5 at 2023-06-27 09:29:07. Total running time: 2min 17s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.37164 │
│ time_total_s                           39.545 │
│ training_iteration                          5 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.17574 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 5 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000004

2023-06-27 09:29:11,158 WARNING util.py:315 -- The `process_trial_save` operation took 3.195 s, which may be a performance bottleneck.
Training finished iteration 6 at 2023-06-27 09:29:12. Total running time: 2min 22s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.60696 │
│ time_total_s                           44.152 │
│ training_iteration                          6 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.15929 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 6 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000005

2023-06-27 09:29:15,862 WARNING util.py:315 -- The `process_trial_save` operation took 3.291 s, which may be a performance bottleneck.
Training finished iteration 7 at 2023-06-27 09:29:17. Total running time: 2min 27s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.76724 │
│ time_total_s                          48.9192 │
│ training_iteration                          7 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.14892 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 7 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000006

2023-06-27 09:29:20,472 WARNING util.py:315 -- The `process_trial_save` operation took 3.135 s, which may be a performance bottleneck.
Training finished iteration 8 at 2023-06-27 09:29:21. Total running time: 2min 31s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.56761 │
│ time_total_s                          53.4868 │
│ training_iteration                          8 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.15599 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 8 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000007

2023-06-27 09:29:25,048 WARNING util.py:315 -- The `process_trial_save` operation took 3.141 s, which may be a performance bottleneck.
Training finished iteration 9 at 2023-06-27 09:29:26. Total running time: 2min 36s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.59671 │
│ time_total_s                          58.0835 │
│ training_iteration                          9 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.14384 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 9 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000008

2023-06-27 09:29:29,678 WARNING util.py:315 -- The `process_trial_save` operation took 3.177 s, which may be a performance bottleneck.
Training finished iteration 10 at 2023-06-27 09:29:31. Total running time: 2min 41s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.61909 │
│ time_total_s                          62.7026 │
│ training_iteration                         10 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.13437 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 10 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000009

2023-06-27 09:29:34,231 WARNING util.py:315 -- The `process_trial_save` operation took 3.110 s, which may be a performance bottleneck.
Training completed after 10 iterations at 2023-06-27 09:29:36. Total running time: 2min 46s

(RayTrainWorker pid=4566, ip=10.0.0.188) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
  0%|          | 0.00/97.8M [00:00<?, ?B/s] [repeated 3x across cluster]
 71%|███████▏  | 69.9M/97.8M [00:00<00:00, 256MB/s] [repeated 9x across cluster]
100%|██████████| 97.8M/97.8M [00:00<00:00, 256MB/s] [repeated 2x across cluster]
(RayTrainWorker pid=4533, ip=10.0.0.186) Moving model to device: cuda:0 [repeated 3x across cluster]
(RayTrainWorker pid=4566, ip=10.0.0.188) Wrapping provided model in DistributedDataParallel. [repeated 3x across cluster]
(base)  ray@g-c2c56abb267620001:~/default$

Versions / Dependencies

nightly

Reproduction script

https://docs.ray.io/en/latest/train/examples/pytorch/pytorch_resnet_finetune.html

Issue Severity

Low: It annoys or frustrates me.

ray-project / ray

[core][AIR output] Weird output at the end of the training #36866

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity