huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.39k stars 26.36k forks source link

OSError Directory not empty error in Trainer.py on checkpoint replacement #17265

Closed randywreed closed 2 years ago

randywreed commented 2 years ago

System Info

- `transformers` version: 4.20.0.dev0
- Platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.6.0
- PyTorch version (GPU?): 1.10.1 (True)
- Tensorflow version (GPU?): 2.7.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: 4
- Using distributed or parallel set-up in script?: deepspeed

Who can help?

@sgugger

Information

Tasks

Reproduction

Create txt file of sentences.
Ran run_clm.py with following parameters:
deepspeed --num_gpus=4 run_clm.py --deepspeed ds_config_gptj6b.json --model_name_or_path EleutherAI/gpt-j-6B --train_file Jesus_sayings.txt --do_train --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir ~/gpt-j/finetuned --num_train_epochs 5 --eval_steps 1 --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 2 --save_steps 1 --save_strategy steps --tokenizer_name gpt2

Error traceback:

[INFO|modeling_utils.py:1546] 2022-05-15 18:25:49,903 >> Model weights saved in /home/ubuntu/gpt-j/finetuned/checkpoint-3/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-05-15 18:25:49,911 >> tokenizer config file saved in /home/ubuntu/gpt-j/finetuned/checkpoint-3/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-05-15 18:25:49,917 >> Special tokens file saved in /home/ubuntu/gpt-j/finetuned/checkpoint-3/special_tokens_map.json
[2022-05-15 18:26:00,522] [INFO] [engine.py:3177:save_16bit_model] Saving model weights to /home/ubuntu/gpt-j/finetuned/checkpoint-3/pytorch_model.bin
[2022-05-15 18:26:26,263] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /home/ubuntu/gpt-j/finetuned/checkpoint-3/global_step3/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-05-15 18:27:44,462] [INFO] [engine.py:3063:_save_zero_checkpoint] zero checkpoint saved /home/ubuntu/gpt-j/finetuned/checkpoint-3/global_step3/zero_pp_rank_0_mp_rank_00_optim_states.pt
[INFO|trainer.py:2424] 2022-05-15 18:27:46,523 >> Deleting older checkpoint [/home/ubuntu/gpt-j/finetuned/checkpoint-1] due to args.save_total_limit
Traceback (most recent call last):
  File "run_clm.py", line 575, in <module>
    main()
  File "run_clm.py", line 523, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1320, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1634, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1805, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1964, in _save_checkpoint
    self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2425, in _rotate_checkpoints
    shutil.rmtree(checkpoint)
  File "/usr/lib/python3.8/shutil.py", line 718, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/usr/lib/python3.8/shutil.py", line 659, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/usr/lib/python3.8/shutil.py", line 657, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'global_step1'
  4%|██▌                                                         | 3/70 [21:59<8:11:00, 439.71s/it]
[2022-05-15 18:27:50,264] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 78507
[2022-05-15 18:27:50,265] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 78508
[2022-05-15 18:27:50,265] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 78509
[2022-05-15 18:27:50,266] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 78510
[2022-05-15 18:27:50,267] [ERROR] [launch.py:184:sigkill_handler] ['/usr/bin/python3', '-u', 'run_clm.py', '--local_rank=3', '--deepspeed', 'ds_config_gptj6b.json', '--model_name_or_path', 'EleutherAI/gpt-j-6B', '--train_file', 'Jesus_sayings.txt', '--do_train', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', '/home/ubuntu/gpt-j/finetuned', '--num_train_epochs', '5', '--eval_steps', '1', '--gradient_accumulation_steps', '32', '--per_device_train_batch_size', '1', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '2', '--save_steps', '1', '--save_strategy', 'steps', '--tokenizer_name', 'gpt2'] exits with return code = 1

Expected behavior

Should delete old checkpoint without error. 

Workaround:
Changed trainer.py line 2425 to

shutil.rmtree(checkpoint, ignore_errors=True)

This causes program to run without error but leaves behind ghost checkpoint directories with no content. Though these are gradually pruned.

sgugger commented 2 years ago

Thanks for the report! That sounds like a reasonable fix. Do you want to make a PR with it?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

lboxell commented 2 years ago

What's the status of this? Is there a workaround without editing the source?

sgugger commented 2 years ago

No PR was raised to fix it, you should go ahead if you want to contribute :-)

njbrake commented 2 months ago

I hit this issue recently when training. This error is very confusing to me because the error

OSError: [Errno 39] Directory not empty:

Should not apply to shutil.rmtree, right? shutil rmtree is designed to remove directories that have content in them (I just tested it out in a python shell to make sure).

So in this case shutil.rmtree(checkpoint) throwing an error about the directory not being empty makes no sense to me. Can someone help explain what's going on (or agree that we are all confused about the behavior 😆 ).

If we are ok with ignoring whatever error this is, I think we also need to add the ignore_error flag to https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/trainer.py#L1969, since the PR #20984 only changed the one inside the rotate_checkpoints folder, and not the one in the inner_training_loop, which I assume was an accident.

amyeroberts commented 2 months ago

Should not apply to shutil.rmtree, right?

@njbrake When you say "should not apply", do you mean should not apply in this case or it can't apply (ever)? If you look at the traceback above, you can see that the error is being triggered in the shutil.rmtree call. There could be several reasons for this happening e.g. if the files being touch in the tree are open in other processes; or perhaps issues with file permissions.

The linked-to line is for v4.31.0, but it's still the same for the current release. If you think this should be changed, feel free to open a PR and we'd be happy to review!

njbrake commented 2 months ago

Should not apply to shutil.rmtree, right?

@njbrake When you say "should not apply", do you mean should not apply in this case or it can't apply (ever)? If you look at the traceback above, you can see that the error is being triggered in the shutil.rmtree call. There could be several reasons for this happening e.g. if the files being touch in the tree are open in other processes; or perhaps issues with file permissions.

The linked-to line is for v4.31.0, but it's still the same for the current release. If you think this should be changed, feel free to open a PR and we'd be happy to review!

I appreciate the question! I think my main confusion is that the error message that the rmtree command is giving is about a directory not being empty, but the goal of the rmtree command in this context is to remove and a directory that's not empty, so the error message makes no sense to me. I'll put up a PR to add the ignore error to the other rmtree command, I just wanted to draw attention that there's something else happening behind the scenes that is still kind of unexplained.