ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.75k stars 5.74k forks source link

[tune] Checkpoint pathing issue when restoring using PBT Scheduler #16714

Closed The-obsrvr closed 3 years ago

The-obsrvr commented 3 years ago

What is the problem?

I am not sure why but ray tune is misreading the checkpoint_dir path and unneccessarily adding " /./" to the path. I am relatively new to ray tune, so if I have missed something please let me know. The models are saved correctly in the folders and the issue is only when ray tune tries to restore the checkpoint again.

Traceback (most recent call last): File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 734, in _process_trial results = self.trial_executor.fetch_result(trial) File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 711, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper return func(*args, *kwargs) File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/worker.py", line 1501, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=667115, ip=10.153.51.169) File "python/ray/_raylet.pyx", line 535, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 485, in ray._raylet.execute_task.function_executor File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor return method(__ray_actor, args, **kwargs) File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/trainable.py", line 175, in train_buffered result = self.train() File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/trainable.py", line 234, in train result = self.step() File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/function_runner.py", line 366, in step self._report_thread_runner_error(block=True) File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/function_runner.py", line 512, in _report_thread_runner_error raise TuneError( ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=667115, ip=10.153.51.169) File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run self._entrypoint() File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint return self._trainable_func(self.config, self._status_reporter, File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/function_runner.py", line 580, in _trainable_func output = fn() File "src/hpo_using_ray.py", line 91, in train_bert with open(model_name_or_path) as checkpt: IsADirectoryError: [Errno 21] Is a directory: '...22-45/checkpoint_tmp75ff46/./best_model-947'

The correct path should be: '...22-45/checkpoint_tmp75ff46/best_model-947' (This path exists)

Ray version and other system information (Python version, TensorFlow version, OS): Ray: 2.0.0.dev0 Python: 3.8 Pytorch: 1.9 OS: Linux

Apologies if I have missed any relevant information.

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

mvindiola1 commented 3 years ago

Hi @The-obsrvr,

I think this error is indicating that you have provided a path to a directory rather than a file. To restore the model tune is expecting a path to a checkpoint file not the directory containing the file. Have you verified that the path you are providing is actually a file?

The-obsrvr commented 3 years ago

I printed the path "checkpoint_dir" at the start of the my trainable function to see how it looks. As far as I know, this path itself is generated by ray tune itself and not defined by me. The path printed has the issue. Thus, I feel ray tune itself is misunderstanding the folder path for some reason. What is interesting is that I believe this problem only persists for folders named "checkpoint_tmp{}" i.e. the temporary checkpoints. My other checkpoints are working fine.

mvindiola1 commented 3 years ago

Do you have a minimal reproduction script with the issue you can share?

The-obsrvr commented 3 years ago

My code is quite big since we are doing multi-task learning with multiple datasets and architectures. If it helps, my code for saving and recovering checkpoints is same as shown in this example, https://colab.research.google.com/drive/1tQgAKgcKQzheoh503OzhS4N9NtfFgmjF?usp=sharing

The only difference is that I just moved the save checkpoint code into my early stopping call. Please let me know if you require any further information. I also feel my issue is somewhat similar to what was discussed here. https://github.com/ray-project/ray/issues/8772#issuecomment-644445940 But I didn't understand their solution.

The-obsrvr commented 3 years ago

Also, not sure if it helps but when I change the path to correspond to the checkpoint file and not its directory, i get the following error. Traceback: ray::ImplicitFunc.train_buffered() (pid=726755, ip=10.153.51.169) File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run self._entrypoint() File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint return self._trainable_func(self.config, self._status_reporter, File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/ray/tune/function_runner.py", line 580, in _trainable_func output = fn() File "src/hpo_using_ray.py", line 92, in train_bert model_state, optimizer_state = torch.load(checkpt) File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/torch/serialization.py", line 595, in load if _is_zipfile(opened_file): File "/mnt/data2/Sid/arg_quality/pytorch/argQ_env2/lib/python3.8/site-packages/torch/serialization.py", line 57, in _is_zipfile byte = f.read(1) File "/usr/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

The location corresponds to the " / . / " that it sees in the path. I still don't know where this dot is coming from.

richardliaw commented 3 years ago

/./ generally is resolved correctly by the operating system (and is actually just ignored). Would you show me your tune.checkpoint_dir call?

The-obsrvr commented 3 years ago
                    with tune.checkpoint_dir(step=self.global_step) as checkpoint_dir:
                        output_dir = os.path.join(
                            checkpoint_dir,
                            "best_model-{}".format(self.global_step))
                        os.makedirs(output_dir, exist_ok=True)

                early_stopping(avg_val_loss, self.model, optimizer, output_dir)

In my early stopping object, I call for saving the model at the output_dir path using torch.save()

mvindiola1 commented 3 years ago

Hi @The-obsrvr,

I just tested the the google colab notebook you shared after updating pip install for ray to 1.4.0 and removing the wandb decorator and that ran without issue for me. Without more code to look at I do not think I can help. Perhaps there is a change you could make to the colab notebook that would reproduce the error you are seeing?

The-obsrvr commented 3 years ago

It worked. You were right that the "/./" was not really an issue and it was actually an issue that stemmed from the pathing. Thank you so much for the help.

mvindiola1 commented 3 years ago

Glad it is working. Does that mean this issue can be closed?

The-obsrvr commented 3 years ago

Yes, sorry for the late reply.