unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.86k stars 1.07k forks source link

Missing `_unsloth_temporary_saved_buffers` #319

Open jlin816 opened 5 months ago

jlin816 commented 5 months ago

I'm getting the following error. I think it's probably because I'm running two training runs on the same machine which might try to create/delete the temporary file around the same time, so that the one that lags slightly behind can't find the temporary file anymore. I haven't validated that's what's happening, but hope that detail is helpful!

Traceback (most recent call last):
  File "/home/jessy/.miniconda3/envs/unsloth/lib/python3.10/pdb.py", line 1723, in main
    pdb._runscript(mainpyfile)
  File "/home/jessy/.miniconda3/envs/unsloth/lib/python3.10/pdb.py", line 1583, in _runscript
    self.run(statement)
  File "/home/jessy/.miniconda3/envs/unsloth/lib/python3.10/bdb.py", line 598, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/home/jessy/projects/llm-objectives/sandbox/multistep_exp_faster.py", line 459, in <module>
    multistep_exp(args)
  File "/home/jessy/projects/llm-objectives/sandbox/multistep_exp_faster.py", line 238, in multistep_exp
    model = train_model(args, model, tokenizer, next_dataset, logdir, model_save_path)
  File "/home/jessy/projects/llm-objectives/sandbox/multistep_exp_faster.py", line 426, in train_model
    model.save_pretrained_merged(model_save_path, tokenizer, save_method = "merged_16bit")
  File "/home/jessy/.miniconda3/envs/unsloth/lib/python3.10/site-packages/unsloth/save.py", line 980, in unsloth_save_pretrained_merged
    unsloth_save_model(**arguments)
  File "/home/jessy/.miniconda3/envs/unsloth/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jessy/.miniconda3/envs/unsloth/lib/python3.10/site-packages/unsloth/save.py", line 632, in unsloth_save_model
    shutil.rmtree(temporary_location)
  File "/home/jessy/.miniconda3/envs/unsloth/lib/python3.10/shutil.py", line 715, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/home/jessy/.miniconda3/envs/unsloth/lib/python3.10/shutil.py", line 713, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '_unsloth_temporary_saved_buffers'
danielhanchen commented 5 months ago

@jlin816 Oh thanks for that!! Hmm I might leave the folder as is then! I'll add a check to not randomnly delete the folder :)

jlin816 commented 5 months ago

Thanks! Does this potentially cause any issues with running two jobs on the same machine (eg mixing up checkpoint data somehow)?