Closed randywreed closed 2 years ago
Thanks for the report! That sounds like a reasonable fix. Do you want to make a PR with it?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
What's the status of this? Is there a workaround without editing the source?
No PR was raised to fix it, you should go ahead if you want to contribute :-)
I hit this issue recently when training. This error is very confusing to me because the error
OSError: [Errno 39] Directory not empty:
Should not apply to shutil.rmtree, right? shutil rmtree is designed to remove directories that have content in them (I just tested it out in a python shell to make sure).
So in this case shutil.rmtree(checkpoint)
throwing an error about the directory not being empty makes no sense to me. Can someone help explain what's going on (or agree that we are all confused about the behavior 😆 ).
If we are ok with ignoring whatever error this is, I think we also need to add the ignore_error flag to https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/trainer.py#L1969, since the PR #20984 only changed the one inside the rotate_checkpoints folder, and not the one in the inner_training_loop, which I assume was an accident.
Should not apply to shutil.rmtree, right?
@njbrake When you say "should not apply", do you mean should not apply in this case or it can't apply (ever)? If you look at the traceback above, you can see that the error is being triggered in the shutil.rmtree
call. There could be several reasons for this happening e.g. if the files being touch in the tree are open in other processes; or perhaps issues with file permissions.
The linked-to line is for v4.31.0, but it's still the same for the current release. If you think this should be changed, feel free to open a PR and we'd be happy to review!
Should not apply to shutil.rmtree, right?
@njbrake When you say "should not apply", do you mean should not apply in this case or it can't apply (ever)? If you look at the traceback above, you can see that the error is being triggered in the
shutil.rmtree
call. There could be several reasons for this happening e.g. if the files being touch in the tree are open in other processes; or perhaps issues with file permissions.The linked-to line is for v4.31.0, but it's still the same for the current release. If you think this should be changed, feel free to open a PR and we'd be happy to review!
I appreciate the question! I think my main confusion is that the error message that the rmtree command is giving is about a directory not being empty, but the goal of the rmtree command in this context is to remove and a directory that's not empty, so the error message makes no sense to me. I'll put up a PR to add the ignore error to the other rmtree command, I just wanted to draw attention that there's something else happening behind the scenes that is still kind of unexplained.
System Info
Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Create txt file of sentences.
Ran run_clm.py with following parameters:
deepspeed --num_gpus=4 run_clm.py --deepspeed ds_config_gptj6b.json --model_name_or_path EleutherAI/gpt-j-6B --train_file Jesus_sayings.txt --do_train --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir ~/gpt-j/finetuned --num_train_epochs 5 --eval_steps 1 --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 2 --save_steps 1 --save_strategy steps --tokenizer_name gpt2
Error traceback:
Expected behavior
This causes program to run without error but leaves behind ghost checkpoint directories with no content. Though these are gradually pruned.