[Tune] [Bug] Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Ray checkpoint sync can sometimes fail to upload checkpoints to s3, plus log spew about sync client observed

Problem Summary

Checkpoints seem to not always get uploaded reliably to s3 by Ray 1.9.1, so the best model produced by the hyperparameter search can't always be pulled at the end of the search for evaluation.

In the case where I recently saw this (Ludwig tf-legacy AutoML for ames_housing dataset, I also saw a spew of the following warnings being output: WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. The warning seems correlated with the problem, in that I didn't see such warnings in the other AutoML jobs running simultaneously, which all completed w/o missing their checkpoints.

From interaction on problem w/Kai Fricke on tune slack channel:

For syncing it would be good to know which Ray version you are using and how your sync configuration looks like at the moment. Please feel free to add me to the discussion so I can help look into this

The Ray version is 1.9.1. The sync configuration is set up in Ludwig using this code: self.sync_config = tune.SyncConfig( sync_to_driver=False, upload_dir=output_directory ) with output directory set to 's3://predibase-runs/nodeless/ames_housing/hours1/'.

Some of the checkpoints from this run did get uploaded successfully (I can see them in s3), so I'm not sure what is going on, including other trials on the same worker node.

I did not manually set the checkpointing interval. One thing I can mention is that the hyperparameter search is running async_hyperband with a time budget, so trials can get stopped if they are considered unpromising. Is it at all possible that a trial would get stopped by async_hyperband while it was trying to sync and the sync could hang or something?

Observations from Repro of problem

After I observed the problem, I updated Ludwig AutoML to make it more resilient to not finding the checkpoints for some trials at the end of the run, and reran the test case. The issue with some checkpoints not getting uploaded for some trials (2 out of 8) reproduced but Ludwig was able to analyze the checkpoints that were uploaded & produce a usable report.

Note that for the spew problem: The attempts to run the sync client cmd seem very close in time, which is surprising since I thought that sync_period is by default 300 seconds. I did not set TUNE_GLOBAL_CHECKPOINT_S, so presumably it is using the auto setting; is that what is causing the spew of attempted sync client cmds I wonder? The spew seems at a higher rate than the epochs reporting results. Just to give a feel for this, here's the first page of spew along with trial results reporting:

(base) ray@example-cluster-ray-head-fh8jv:~/expe Result for trial_79924664: (ImplicitFunc pid=28222) 2022-01-07 10:56:39,426 (ImplicitFunc pid=28222) 2022-01-07 10:56:40,037 (ImplicitFunc pid=28222) 2022-01-07 10:56:40,644 (ImplicitFunc pid=28222) 2022-01-07 10:56:41,192 (ImplicitFunc pid=28222) 2022-01-07 10:56:41,731 (ImplicitFunc pid=28222) 2022-01-07 10:56:42,273 (ImplicitFunc pid=28222) 2022-01-07 10:56:42,836 Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: Result for trial_79924664: (ImplicitFunc pid=28222) 2022-01-07 10:56:43,389 (ImplicitFunc pid=28222) 2022-01-07 10:56:43,957 (ImplicitFunc pid=28222) 2022-01-07 10:56:44,540 (ImplicitFunc pid=28222) 2022-01-07 10:56:45,057 (ImplicitFunc pid=28222) 2022-01-07 10:56:47,281 (ImplicitFunc pid=28222) 2022-01-07 10:56:47,935 (ImplicitFunc pid=28222) 2022-01-07 10:56:48,599 (ImplicitFunc pid=28222) 2022-01-07 10:56:49,171 (ImplicitFunc pid=28222) 2022-01-07 10:56:49,749 (ImplicitFunc pid=28222) 2022-01-07 10:56:50,306 (ImplicitFunc pid=28222) 2022-01-07 10:56:50,820 (ImplicitFunc pid=28222) 2022-01-07 10:56:53,049 (ImplicitFunc pid=28222) 2022-01-07 10:56:53,747 (ImplicitFunc pid=28222) 2022-01-07 10:56:54,373 (ImplicitFunc pid=28222) 2022-01-07 10:56:54,963 (ImplicitFunc pid=28222) 2022-01-07 10:56:57,026 (ImplicitFunc pid=28222) 2022-01-07 10:56:57,613 (ImplicitFunc pid=28222) 2022-01-07 10:56:58,177 (ImplicitFunc pid=28222) 2022-01-07 10:57:00,622 (ImplicitFunc pid=28222) 2022-01-07 10:57:01,291 (ImplicitFunc pid=28222) 2022-01-07 10:57:03,430 (ImplicitFunc pid=28222) 2022-01-07 10:57:04,117 (ImplicitFunc pid=28222) 2022-01-07 10:57:04,709 (ImplicitFunc pid=28222) 2022-01-07 10:57:05,303 (ImplicitFunc pid=28222) 2022-01-07 10:57:05,920 (ImplicitFunc pid=28222) 2022-01-07 10:57:07,936 (ImplicitFunc pid=28222) 2022-01-07 10:57:09,444 (ImplicitFunc pid=28222) 2022-01-07 10:57:10,096 (ImplicitFunc pid=28222) 2022-01-07 10:57:10,636 riments/automl/validation$ grep -e WARNING -e "Result for" ames_housing/run.1hr |more WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping. WARNING sync_client.py:320 -- Last sync client cmd still in progress, skipping.

Versions / Dependencies

Ray 1.9.1

Reproduction script

https://github.com/ludwig-ai/experiments/blob/main/automl/validation/ames_housing/run_auto_train_1hr_nodeless.py run with ludwig tf-legacy

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Thanks Anne. There are two issues:

checkpoint lost. I have a few questions/comments on this.
- What is the checkpoint frequency and how large are the checkpoints in your case? Sometimes this would happen when checkpointing too often and too large.
- We are thinking of queueing checkpointing syncing instead of just bailing out when there is still one on-going. This is something we'd like to enhance for Q1. We also probably want to shift to use boto client instead of command syncing. cc @krfricke
get_alive_node_ip information is stale. On this, I am looking at if we could get rid of this logic entirely.

On top of that, Tune currently is not officially supporting multiple Tune sessions running on the same cluster. But we could probably enhance this capability if we collect enough use cases requiring so.

Let me know if I capture the issues correctly.

Also cc @richardliaw

Hi, @xwjiang2010, In response to your last comment:

This ticket is specifically on the issue with checkpoints being lost. Thanks for your questions on that issue; I will answer your questions in a follow-on comment.

I have filed a separate ticket on the get_alive_node_ip issue, which is https://github.com/ray-project/ray/issues/21458 It would be great if you can get rid of that logic entirely because it seems too racy.

Wrt Tune not officially supporting multiple Tune sessions on the same cluster, it does not seem to me that either of the above problems is actually related to that use case; it seems like they could occur with a single Ray Tune job running on a cluster.

Yes, great. Sorry I missed the other ticket. Let's keep the discussion separate on the two tickets.

And yes, these issues are orthogonal to single v.s. multiple tune job.

Wrt size of checkpoints, I've seen the problem with the dataset ames_housing and I haven't seen it with higgs or forest_cover and below are examples of their uploaded sizes. Maybe ames_housing looks somewhat bigger but it doesn't look dramatic to me.

I'm not setting the checkpoint_period or sync_period, so my runs are getting the defaults. What does seem to be true to me is that the epochs complete much more quickly for ames_housing than for forest_cover or higgs so presumably, much more epoch end stats reporting is happening. Perhaps the automatic adaptive tuning for checkpoint period is actually too aggressive for this case. If you think I should experiment w/setting sync_period and/or checkpoint_period explicitly, let me know. Thanks!

(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/ames_housing/hours1/trainable_func_f6zONnP/trial_79924664/checkpoint_001459/model/ PRE logs/ PRE training_checkpoints/ 2022-01-07 11:55:59 83 checkpoint 2022-01-07 11:55:59 35350 model_hyperparameters.json 2022-01-07 11:55:59 584070 model_weights.data-00000-of-00001 2022-01-07 11:55:59 13165 model_weights.index 2022-01-07 11:55:59 985083 training_progress.json 2022-01-07 11:55:59 51110 training_set_metadata.json

(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/forest_cover/hours1/trainable_func_fVfbX2v/trial_86415c74/checkpoint_000401/model/ PRE logs/ PRE training_checkpoints/ 2022-01-07 11:56:31 83 checkpoint 2022-01-07 11:56:31 12198 model_hyperparameters.json 2022-01-07 11:56:31 215517 model_weights.data-00000-of-00001 2022-01-07 11:56:31 8389 model_weights.index 2022-01-07 11:56:31 175827 training_progress.json 2022-01-07 11:56:31 4170 training_set_metadata.json

(base) MacBook-Air-2:forest_cover anne$ aws s3 ls s3://predibase-runs/nodeless/higgs/hours1/trainable_func_fI6YKzO/trial_30043e10/checkpoint_000016/model/ PRE logs/ PRE training_checkpoints/ 2022-01-07 12:05:20 83 checkpoint 2022-01-07 12:05:20 14674 model_hyperparameters.json 2022-01-07 12:05:20 226790 model_weights.data-00000-of-00001 2022-01-07 12:05:20 8273 model_weights.index 2022-01-07 12:05:20 8295 training_progress.json 2022-01-07 12:05:20 6428 training_set_metadata.json

@amholler Yes, reducing the checkpoint frequency can really help here. Are you using class Trainables? If so, you can simply set checkpoint_freq through tune.run. Let me know how it goes!

Hi, @xwjiang2010 AFAIK, we aren't using class Trainables; we are using the function API. I tried setting TUNE_GLOBAL_CHECKPOINT_S but that didn't work in this case. Explanation below.

I have put more tracing into the ames_housing runs, and I see more about what's going on. I will give an example tracing sequence below after giving some general observations about it.

In the steady state, the checkpoint sync'ing to the cloud is being called from _maybe_save_to_cloud, which does not respect the checkpoint period, i.e, it calls sync_up rather than sync_up_if_needed. It is invoked very often for this workload (presumably due to its quick epochs), like every few secs.

And these sync invocations often fail for this workload, because the deletion of old/suboptimal (retention 1) checkpoints from the cloud is also running very frequently and hence blocks the checkpoint sync, since only one sync client cmd can be running at a time. I don't know how/whether the checkpoint period is expected to apply to the delete operations, but their frequent execution can choke the syncs.

^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:47,781 WARNING sync_client.py:346 -- Running sync: aws s3 sync /home/ray/ray_results/trainable_func_fk4_l_i/trial_5677d610/ s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610 --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m ray.worker.global_worker.main_loop() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 431, in main_loop ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.core_worker.run_task_loop() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(__ray_actor, *args, kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, *_kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 452, in save ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self._maybe_save_to_cloud() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, _args, _kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 416, in _maybe_save_to_cloud ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.remote_checkpoint_dir) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 263, in sync_up ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return self._execute(self.sync_up_template, source, target, exclude) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 321, in _execute ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m stack_trace_str = "".join(traceback.extract_stack().format()) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m

^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:49,039 WARNING sync_client.py:346 -- Running sync: aws s3 sync /home/ray/ray_results/trainable_func_fk4_l_i/trial_5677d610/ s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610 --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m ray.worker.global_worker.main_loop() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 431, in main_loop ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.core_worker.run_task_loop() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(__ray_actor, *args, kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, *_kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 452, in save ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self._maybe_save_to_cloud() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, _args, _kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 416, in _maybe_save_to_cloud ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.remote_checkpoint_dir) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 263, in sync_up ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return self._execute(self.sync_up_template, source, target, exclude) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 321, in _execute ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m stack_trace_str = "".join(traceback.extract_stack().format()) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m

^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:49,876 WARNING sync_client.py:275 -- Running delete: aws s3 rm s3://predibase-elotl/baseline/ames_housing/hours1/trainable_func_fk4_l_i/trial_5677d610/checkpoint_000001 --recursive --only-show-errors File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m ray.worker.global_worker.main_loop() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 431, in main_loop ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.core_worker.run_task_loop() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(__ray_actor, *args, *kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, _args, **_kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 504, in delete_checkpoint ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.storage_client.delete(self._storage_path(checkpoint_dir)) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 269, in delete ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m stack_trace_str = "".join(traceback.extract_stack().format()) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m

^[[2m^[[36m(ImplicitFunc pid=None)^[[0m 2022-01-08 07:26:50,396 WARNING sync_client.py:323 -- Last sync client cmd still in progress, skipping. File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 218, in ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m ray.worker.global_worker.main_loop() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 431, in main_loop ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.core_worker.run_task_loop() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(__ray_actor, *args, kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, *_args, *_kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 452, in save ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self._maybe_save_to_cloud() ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return method(self, _args, _kwargs) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 416, in _maybe_save_to_cloud ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m self.remote_checkpoint_dir) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 263, in sync_up ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m return self._execute(self.sync_up_template, source, target, exclude) ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 321, in _execute ^[[2m^[[36m(ImplicitFunc pid=None)^[[0m stack_trace_str = "".join(traceback.extract_stack().format())

Hi @amholler Yes, for your use case, I don't think there is throttling happening for uploading to cloud from syncing perspective. Syncing is happening at the same frequency as checkpointing. Could you reduce the frequency of checkpointing then? (it's not gonna be checkpoint_freq as that is only used for class trainable). Most likely check how often you call with tune.checkpoint_dir...

Let me know how it goes. I will also update our doc as they seem unclear.

To capture some of the discussion offline.

There are several paths going forward:

Try to optimize sync up/delete and see if we can keep up with per epoch checkpointing/deleting (Xiaowei to do some benchmarking). Let's also see if deleting can be done in a batch way. Does that improve anything?
Tune to throttle syncing
User to be responsible for setting reasonable checkpointing freq (in which case, Tune should still probably provide some guidelines).
User can have the choice to choose reliable checkpointing per epoch at the cost of performance.

Whichever we choose among 2-4 or a combination of them, we should probably understand what the current bottleneck is (i.e. 1).

My patch in sync_client to wait rather than to skip when client cmd still in progress did not remove all instances where checkpoints were lost. I haven't tried to investigate this further.
Turning off checkpoint deletion, i.e., setting keep_checkpoints_num to None rather than 1 (which is hard-coded in Ludwig) did make the lost checkpoint and sync cmd skipping problem go away for my test.

ray-project / ray