Open amholler opened 2 years ago
Thanks Anne. There are two issues:
On top of that, Tune currently is not officially supporting multiple Tune sessions running on the same cluster. But we could probably enhance this capability if we collect enough use cases requiring so.
Let me know if I capture the issues correctly.
Also cc @richardliaw
Hi, @xwjiang2010, In response to your last comment:
This ticket is specifically on the issue with checkpoints being lost. Thanks for your questions on that issue; I will answer your questions in a follow-on comment.
I have filed a separate ticket on the get_alive_node_ip issue, which is https://github.com/ray-project/ray/issues/21458 It would be great if you can get rid of that logic entirely because it seems too racy.
Wrt Tune not officially supporting multiple Tune sessions on the same cluster, it does not seem to me that either of the above problems is actually related to that use case; it seems like they could occur with a single Ray Tune job running on a cluster.
Yes, great. Sorry I missed the other ticket. Let's keep the discussion separate on the two tickets.
And yes, these issues are orthogonal to single v.s. multiple tune job.
Wrt size of checkpoints, I've seen the problem with the dataset ames_housing and I haven't seen it with higgs or forest_cover and below are examples of their uploaded sizes. Maybe ames_housing looks somewhat bigger but it doesn't look dramatic to me.
I'm not setting the checkpoint_period or sync_period, so my runs are getting the defaults. What does seem to be true to me is that the epochs complete much more quickly for ames_housing than for forest_cover or higgs so presumably, much more epoch end stats reporting is happening. Perhaps the automatic adaptive tuning for checkpoint period is actually too aggressive for this case. If you think I should experiment w/setting sync_period and/or checkpoint_period explicitly, let me know. Thanks!
(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/ames_housing/hours1/trainable_func_f6zONnP/trial_79924664/checkpoint_001459/model/ PRE logs/ PRE training_checkpoints/ 2022-01-07 11:55:59 83 checkpoint 2022-01-07 11:55:59 35350 model_hyperparameters.json 2022-01-07 11:55:59 584070 model_weights.data-00000-of-00001 2022-01-07 11:55:59 13165 model_weights.index 2022-01-07 11:55:59 985083 training_progress.json 2022-01-07 11:55:59 51110 training_set_metadata.json
(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/forest_cover/hours1/trainable_func_fVfbX2v/trial_86415c74/checkpoint_000401/model/ PRE logs/ PRE training_checkpoints/ 2022-01-07 11:56:31 83 checkpoint 2022-01-07 11:56:31 12198 model_hyperparameters.json 2022-01-07 11:56:31 215517 model_weights.data-00000-of-00001 2022-01-07 11:56:31 8389 model_weights.index 2022-01-07 11:56:31 175827 training_progress.json 2022-01-07 11:56:31 4170 training_set_metadata.json
(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/higgs/hours1/trainable_func_fI6YKzO/trial_30043e10/checkpoint_000016/model/ PRE logs/ PRE training_checkpoints/ 2022-01-07 12:05:20 83 checkpoint 2022-01-07 12:05:20 14674 model_hyperparameters.json 2022-01-07 12:05:20 226790 model_weights.data-00000-of-00001 2022-01-07 12:05:20 8273 model_weights.index 2022-01-07 12:05:20 8295 training_progress.json 2022-01-07 12:05:20 6428 training_set_metadata.json
@amholler Yes, reducing the checkpoint frequency can really help here. Are you using class Trainables? If so, you can simply set checkpoint_freq
through tune.run
. Let me know how it goes!
Hi, @xwjiang2010 AFAIK, we aren't using class Trainables; we are using the function API. I tried setting TUNE_GLOBAL_CHECKPOINT_S but that didn't work in this case. Explanation below.
I have put more tracing into the ames_housing runs, and I see more about what's going on. I will give an example tracing sequence below after giving some general observations about it.
In the steady state, the checkpoint sync'ing to the cloud is being called from _maybe_save_to_cloud, which does not respect the checkpoint period, i.e, it calls sync_up rather than sync_up_if_needed. It is invoked very often for this workload (presumably due to its quick epochs), like every few secs.
And these sync invocations often fail for this workload, because the deletion of old/suboptimal (retention 1) checkpoints from the cloud is also running very frequently and hence blocks the checkpoint sync, since only one sync client cmd can be running at a time. I don't know how/whether the checkpoint period is expected to apply to the delete operations, but their frequent execution can choke the syncs.
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:47,781 WARNING sync_client.py:346 -- Running sync: aws s3 sync /home/ray/ray_results/trainable_func_fk4_l_i/trial_5677d610/ s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610 --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:49,039 WARNING sync_client.py:346 -- Running sync: aws s3 sync /home/ray/ray_results/trainable_func_fk4_l_i/trial_5677d610/ s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610 --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:49,876 WARNING sync_client.py:275 -- Running delete: aws s3 rm s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610/checkpoint_000001 --recursive --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in
^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:50,396 WARNING sync_client.py:323 -- Last sync client cmd still in progress, skipping. File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in
Hi @amholler
Yes, for your use case, I don't think there is throttling happening for uploading to cloud from syncing perspective.
Syncing is happening at the same frequency as checkpointing.
Could you reduce the frequency of checkpointing then? (it's not gonna be checkpoint_freq
as that is only used for class trainable). Most likely check how often you call with tune.checkpoint_dir...
Let me know how it goes. I will also update our doc as they seem unclear.
To capture some of the discussion offline.
There are several paths going forward:
Whichever we choose among 2-4 or a combination of them, we should probably understand what the current bottleneck is (i.e. 1).
My patch in sync_client to wait rather than to skip when client cmd still in progress did not remove all instances where checkpoints were lost. I haven't tried to investigate this further.
Turning off checkpoint deletion, i.e., setting keep_checkpoints_num to None rather than 1 (which is hard-coded in Ludwig) did make the lost checkpoint and sync cmd skipping problem go away for my test.
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed
Problem Summary
Checkpoints seem to not always get uploaded reliably to s3 by Ray 1.9.1, so the best model produced by the hyperparameter search can't always be pulled at the end of the search for evaluation.
In the case where I recently saw this (Ludwig tf-legacy AutoML for ames_housing dataset, I also saw a spew of the following warnings being output: WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. The warning seems correlated with the problem, in that I didn't see such warnings in the other AutoML jobs running simultaneously, which all completed w/o missing their checkpoints.
From interaction on problem w/Kai Fricke on tune slack channel:
The Ray version is 1.9.1. The sync configuration is set up in Ludwig using this code: self.sync_config = tune.SyncConfig( sync_to_driver=False, upload_dir=output_directory ) with output directory set to 's3://predibase-runs/nodeless/ames_housing/hours1/'.
Some of the checkpoints from this run did get uploaded successfully (I can see them in s3), so I'm not sure what is going on, including other trials on the same worker node.
I did not manually set the checkpointing interval. One thing I can mention is that the hyperparameter search is running async_hyperband with a time budget, so trials can get stopped if they are considered unpromising. Is it at all possible that a trial would get stopped by async_hyperband while it was trying to sync and the sync could hang or something?
Observations from Repro of problem
After I observed the problem, I updated Ludwig AutoML to make it more resilient to not finding the checkpoints for some trials at the end of the run, and reran the test case. The issue with some checkpoints not getting uploaded for some trials (2 out of 8) reproduced but Ludwig was able to analyze the checkpoints that were uploaded & produce a usable report.
Note that for the spew problem: The attempts to run the sync client cmd seem very close in time, which is surprising since I thought that sync_period is by default 300 seconds. I did not set TUNE_GLOBAL_CHECKPOINT_S, so presumably it is using the auto setting; is that what is causing the spew of attempted sync client cmds I wonder? The spew seems at a higher rate than the epochs reporting results. Just to give a feel for this, here's the first page of spew along with trial results reporting:
(base) ray@example-cluster-ray-head-fh8jv:~/experiments/automl/validation$ grep -e WARNING -e "Result for" ames_housing/run.1hr |more Result for trial_79924664: (ImplicitFunc pid=28222) 2022-01-07 10:56:39,426 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:40,037 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:40,644 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:41,192 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:41,731 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:42,273 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:42,836 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: (ImplicitFunc pid=28222) 2022-01-07 10:56:43,389 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:43,957 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:44,540 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:45,057 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:47,281 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:47,935 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:48,599 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:49,171 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:49,749 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:50,306 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:50,820 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:53,049 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:53,747 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:54,373 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:54,963 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:57,026 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:57,613 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:56:58,177 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:00,622 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:01,291 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:03,430 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:04,117 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:04,709 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:05,303 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:05,920 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:07,936 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:09,444 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:10,096 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. (ImplicitFunc pid=28222) 2022-01-07 10:57:10,636 WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.
Versions / Dependencies
Ray 1.9.1
Reproduction script
https://github.com/ludwig-ai/experiments/blob/main/automl/validation/ames_housing/run_auto_train_1hr_nodeless.py run with ludwig tf-legacy
Anything else
No response
Are you willing to submit a PR?