Missing support for `resume_from_checkpoint` in `unsloth_train`

OrigamiDream commented 6 days ago

It seems currently unsloth_train does not support resuming model training from checkpoint. The argument is required to quickly restore training state from unexpected machine failure (especially on vast.ai)

Do you have plan to add the argument in the function as quickly as possible? Thanks!

danielhanchen commented 3 days ago

@OrigamiDream Apologies on the delay! We added the fix into HuggingFace, so temporarily please use the nightly version of HF ie:

pip install unsloth
# Also get the latest nightly Unsloth!
pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"

unsloth_train will now function the same as trainer.train()

OrigamiDream commented 3 days ago

I got it, thank you!

unslothai / unsloth-zoo

Missing support for `resume_from_checkpoint` in `unsloth_train` #3