ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.7k stars 5.73k forks source link

[core][AIR output] Weird output at the end of the training #36866

Open scottsun94 opened 1 year ago

scottsun94 commented 1 year ago

What happened + What you expected to happen

I ran this torch fine-tuning resnet example: https://docs.ray.io/en/latest/train/examples/pytorch/pytorch_resnet_finetune.html

The full output is attached below.

The additional output at the end is the same as that at relatively the beginning of the script. This feels like a bug to me. The entire training takes relatively a long time, it shouldn't be entirely due to fast training.

(RayTrainWorker pid=4566, ip=10.0.0.188) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
  0%|          | 0.00/97.8M [00:00<?, ?B/s] [repeated 3x across cluster]
 71%|███████▏  | 69.9M/97.8M [00:00<00:00, 256MB/s] [repeated 9x across cluster]
100%|██████████| 97.8M/97.8M [00:00<00:00, 256MB/s] [repeated 2x across cluster]
(RayTrainWorker pid=4533, ip=10.0.0.186) Moving model to device: cuda:0 [repeated 3x across cluster]
(RayTrainWorker pid=4566, ip=10.0.0.188) Wrapping provided model in DistributedDataParallel. [repeated 3x across cluster]
(base)  ray@g-c2c56abb267620001:~/default$ /home/ray/anaconda3/bin/python /home/ray/default/test-15.py
2023-06-27 09:26:47,550 INFO worker.py:1427 -- Connecting to existing Ray cluster at address: 10.0.0.28:6379...
2023-06-27 09:26:47,558 INFO worker.py:1608 -- Connected to Ray cluster. View the dashboard at https://session-kv6qd59td8w6e5rd7lz81ki4np.i.anyscaleuserdata-staging.com 
2023-06-27 09:26:48,031 INFO packaging.py:346 -- Pushing file package 'gcs://_ray_pkg_c8edb18e2d92888e0a3d0c2c1de789e1.zip' (168.93MiB) to Ray cluster...
2023-06-27 09:26:48,608 INFO packaging.py:359 -- Successfully pushed file package 'gcs://_ray_pkg_c8edb18e2d92888e0a3d0c2c1de789e1.zip'.
2023-06-27 09:26:49,628 INFO tune.py:226 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2023-06-27 09:26:49,630 INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://docs.ray.io/en/master/ray-air/experimental-features.html

View detailed results here: /tmp/ray_results/finetune-resnet
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray_results/finetune-resnet`

(autoscaler +54s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +54s) Adding 4 node(s) of type worker-node-type-1.
2023-06-27 09:27:50,289 WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. Training has not started in the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 5.0 CPUs and 4.0 GPUs, but the cluster only has 32.0 CPUs and 0 GPUs available. Stop the training and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
(autoscaler +1m34s) Resized to 96 CPUs, 4 GPUs.
Training started with configuration:
╭──────────────────────────────────────╮
│ Training config                      │
├──────────────────────────────────────┤
│ train_loop_config/batch_size      32 │
│ train_loop_config/input_size     224 │
│ train_loop_config/lr           0.001 │
│ train_loop_config/momentum       0.9 │
│ train_loop_config/num_epochs      10 │
╰──────────────────────────────────────╯

(TorchTrainer pid=4541, ip=10.0.0.187) Starting distributed worker processes: ['4601 (10.0.0.187)', '4388 (10.0.0.184)', '4533 (10.0.0.186)', '4566 (10.0.0.188)']
(RayTrainWorker pid=4601, ip=10.0.0.187) Setting up process group for: env:// [rank=0, world_size=4]
(RayTrainWorker pid=4388, ip=10.0.0.184) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
  0%|          | 0.00/97.8M [00:00<?, ?B/s]
 20%|█▉        | 19.5M/97.8M [00:00<00:00, 204MB/s]
100%|██████████| 97.8M/97.8M [00:00<00:00, 331MB/s]
(RayTrainWorker pid=4388, ip=10.0.0.184) Moving model to device: cuda:0
(RayTrainWorker pid=4388, ip=10.0.0.184) Wrapping provided model in DistributedDataParallel.
Training finished iteration 1 at 2023-06-27 09:28:45. Total running time: 1min 55s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                       17.558 │
│ time_total_s                           17.558 │
│ training_iteration                          1 │
│ acc                                         1 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.49166 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 1 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000000

2023-06-27 09:28:53,085 WARNING util.py:315 -- The `process_trial_save` operation took 7.106 s, which may be a performance bottleneck.
2023-06-27 09:28:53,085 WARNING trial_runner.py:928 -- Consider turning off forced head-worker trial checkpoint syncs by setting sync_on_checkpoint=False. Note that this may result in faulty trial restoration if a failure occurs while the checkpoint is being synced from the worker to the head node.
Training finished iteration 2 at 2023-06-27 09:28:54. Total running time: 2min 4s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      8.63442 │
│ time_total_s                          26.1924 │
│ training_iteration                          2 │
│ acc                                         1 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.32971 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 2 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000001

2023-06-27 09:28:57,798 WARNING util.py:315 -- The `process_trial_save` operation took 3.188 s, which may be a performance bottleneck.
Training finished iteration 3 at 2023-06-27 09:28:59. Total running time: 2min 9s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                       4.5636 │
│ time_total_s                           30.756 │
│ training_iteration                          3 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.24576 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 3 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000002

2023-06-27 09:29:02,186 WARNING util.py:315 -- The `process_trial_save` operation took 3.013 s, which may be a performance bottleneck.
Training finished iteration 4 at 2023-06-27 09:29:03. Total running time: 2min 13s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                       4.4174 │
│ time_total_s                          35.1734 │
│ training_iteration                          4 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.19502 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 4 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000003

(autoscaler +2m20s) Adding 4 node(s) of type worker-node-type-1.
2023-06-27 09:29:06,546 WARNING util.py:315 -- The `process_trial_save` operation took 2.955 s, which may be a performance bottleneck.
Training finished iteration 5 at 2023-06-27 09:29:07. Total running time: 2min 17s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.37164 │
│ time_total_s                           39.545 │
│ training_iteration                          5 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.17574 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 5 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000004

2023-06-27 09:29:11,158 WARNING util.py:315 -- The `process_trial_save` operation took 3.195 s, which may be a performance bottleneck.
Training finished iteration 6 at 2023-06-27 09:29:12. Total running time: 2min 22s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.60696 │
│ time_total_s                           44.152 │
│ training_iteration                          6 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.15929 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 6 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000005

2023-06-27 09:29:15,862 WARNING util.py:315 -- The `process_trial_save` operation took 3.291 s, which may be a performance bottleneck.
Training finished iteration 7 at 2023-06-27 09:29:17. Total running time: 2min 27s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.76724 │
│ time_total_s                          48.9192 │
│ training_iteration                          7 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.14892 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 7 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000006

2023-06-27 09:29:20,472 WARNING util.py:315 -- The `process_trial_save` operation took 3.135 s, which may be a performance bottleneck.
Training finished iteration 8 at 2023-06-27 09:29:21. Total running time: 2min 31s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.56761 │
│ time_total_s                          53.4868 │
│ training_iteration                          8 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.15599 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 8 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000007

2023-06-27 09:29:25,048 WARNING util.py:315 -- The `process_trial_save` operation took 3.141 s, which may be a performance bottleneck.
Training finished iteration 9 at 2023-06-27 09:29:26. Total running time: 2min 36s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.59671 │
│ time_total_s                          58.0835 │
│ training_iteration                          9 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.14384 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 9 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000008

2023-06-27 09:29:29,678 WARNING util.py:315 -- The `process_trial_save` operation took 3.177 s, which may be a performance bottleneck.
Training finished iteration 10 at 2023-06-27 09:29:31. Total running time: 2min 41s
╭───────────────────────────────────────────────╮
│ Training result                               │
├───────────────────────────────────────────────┤
│ time_this_iter_s                      4.61909 │
│ time_total_s                          62.7026 │
│ training_iteration                         10 │
│ acc                                   0.97368 │
│ config/train_loop_config/batch_size        32 │
│ config/train_loop_config/input_size       224 │
│ config/train_loop_config/lr             0.001 │
│ config/train_loop_config/momentum         0.9 │
│ config/train_loop_config/num_epochs        10 │
│ loss                                  0.13437 │
╰───────────────────────────────────────────────╯

Training saved a checkpoint for iteration 10 at: /tmp/ray_results/finetune-resnet/TorchTrainer_6ad9b_00000_0_2023-06-27_09-26-50/checkpoint_000009

2023-06-27 09:29:34,231 WARNING util.py:315 -- The `process_trial_save` operation took 3.110 s, which may be a performance bottleneck.
Training completed after 10 iterations at 2023-06-27 09:29:36. Total running time: 2min 46s

(RayTrainWorker pid=4566, ip=10.0.0.188) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/ray/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
  0%|          | 0.00/97.8M [00:00<?, ?B/s] [repeated 3x across cluster]
 71%|███████▏  | 69.9M/97.8M [00:00<00:00, 256MB/s] [repeated 9x across cluster]
100%|██████████| 97.8M/97.8M [00:00<00:00, 256MB/s] [repeated 2x across cluster]
(RayTrainWorker pid=4533, ip=10.0.0.186) Moving model to device: cuda:0 [repeated 3x across cluster]
(RayTrainWorker pid=4566, ip=10.0.0.188) Wrapping provided model in DistributedDataParallel. [repeated 3x across cluster]
(base)  ray@g-c2c56abb267620001:~/default$ 

Versions / Dependencies

nightly

Reproduction script

https://docs.ray.io/en/latest/train/examples/pytorch/pytorch_resnet_finetune.html

Issue Severity

Low: It annoys or frustrates me.

scottsun94 commented 1 year ago

@rkooo567 this seems more like a core issue?

rkooo567 commented 1 year ago

This actually seems like a bug based on that it is printed 1m after things are printed. It is not clear if it is core or tune (I remember Tune has a special way to print logs, and it could be related to it?) @scottsun94 can you provide a repro script and bump the priority?

rkooo567 commented 1 year ago

workspace to repro https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_7ll3v91wn7s1fuy4mucdtdki8m/ses_kv6qd59td8w6e5rd7lz81ki4np

krfricke commented 1 year ago

Tune just uses regular stdout/stderr logging. The only custom thing we do is sometimes duplicating logs to a file, but this is not the case here and shouldn't affect it. I also believe this is a core issue.