Open aeltorio opened 1 week ago
@aeltorio , @SunMarc Unfortunately restarting from checkpoint does not work. Despite multiple attempts, I was unable to successfully restart a halted job.
@aeltorio, In light of this issue, I would like to recommend, @aeltorio, that you consider using a Virtual Machine (VM) with guaranteed compute resources instead of a preemptible VM. This approach may help mitigate the problem and ensure a more stable computing environment.
@herokukms ,
using a Virtual Machine (VM) with guaranteed compute resources instead of a preemptible VM
@herokukms thank you for your message regarding the use of a Virtual Machine (VM) with guaranteed compute resources as an alternative to a preemptible VM. However, I'm afraid this solution may not be the most suitable for my current needs.
As I am a self-employed individual working on a research project, I have to be mindful of my expenses. Unfortunately, guaranteed V100 VMs are three times more expensive than preemptible VMs. Given that each run lasts approximately 20 hours and I anticipate having to make adjustments after the initial attempt, I had budgeted for three fine-tuning runs. Unless I receive a donation of 60 hours of guaranteed V100 VM 😁 (you ? 😉), I still require a more cost-effective solution.
Furthermore, I would like to reiterate the importance of finding a solution for restarting a failed job from the last checkpoint. I would appreciate it if you could provide me with an update on this matter.
Ronan
cc @SunMarc @muellerzr and @BenjaminBossan it seems like the trainer is only saving the adapter weights, and the trainer is therefore failing the reloading of the checkpoint. It would be great to look into it if your bandwidths allow it :)
@LysandreJik
Yes, it only saves the adapter weights, which is fine once the training is complete. However, this approach removes the ability to restart the training process.
@SunMarc @muellerz @BenjaminBossan It might be beneficial to introduce an option for saving checkpoints during training. This would consume more space, so making it optional would be ideal.
Best, Ronan
It might be beneficial to introduce an option for saving checkpoints during training. This would consume more space, so making it optional would be ideal.
I think instead, we should try to detect if it's a PEFT checkpoint. Then we can use the adapter_config.json
to detect and load the base model first, then add the PEFT adapter.
For the meantime, instead of resuming from the model checkpoing, could you try loading the base model, then loading the trained LoRA adapter using model.load_adapter(<path>)
and see if that works?
@BenjaminBossan
First of all thank you for your help. Restarting is needed for me because I don't have access to reliable GPU VMs I only use some free GPU time….
I've just created a notebook
with loading the adapter.
You can try it on any CUDA device it works (very slowly on my poor local RTX2060)
It does not work but it might be not exactly the test you want ?
Ronan
my env is:
transformers
version: 4.47.0.dev0The exact environment used is a Docker image ran with:
docker run --gpus all --user=42420:42420 -p 8080:8080 -e HF_TOKEN=hf_TOKEN -it sctg/roco-idefics3:0.0.5 bash -i /start.sh sleep infinity
Simply browse to http://local_or_distant_host:8080 you'll find the notebook…
The Dockerfile is here
@aeltorio I'll investigate fixing the checkpoint issue for PEFT models in transformers, it'll probably take a bit of time.
Meanwhile, my suggestion was that if you want to resume training, don't use resume_from_checkpoint=True
. Instead, load the PEFT model manually from the last checkpoint, pass it to the Trainer
and continue training from there. I know it's not the same thing as fully resuming from checkpoint, but it might still unblock you for the time being.
@BenjaminBossan
Thanks a lot for your work.
For finishing my proof-of-concept model I started multiple trainings each time from the previous model with a subset of the dataset.
Ronan
System Info
transformers
version: 4.47.0.dev0Who can help?
@muellerzr @SunMarc
I tried to fine tune a model, since I use a preemptive VM I decided to use the
resume_from_checkpoint = True
and thepush_to_hub = True
TrainingArguments. As a predictable event the VM stopped after ≈2350 steps / 12k steps… After restarting it I'd like to continue my trainer, I rerun my notebook cells except thetrainer.train()
replaced bytrainer.train(resume_from_checkpoint = True)
Unfortunately the train process does not want to restart with the error:
I also tried to specify the exact path of the working directory ( /workspace/IDEFICS3_ROCO ) or the latest checkpoint path ( /IDEFICS3_ROCO/checkpoint-2350)
on my Hugging Face repo the trainer commited 235 commits with name "Training in progress, step xxx0"
When I look to the content of the VM directory I have:
my full notebook is:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Full Colab notebook
https://colab.research.google.com/#fileId=https://huggingface.co/eltorio/IDEFICS3_ROCO/blob/main/ROCO-idefics3.ipynb
Expected behavior
Trainer should create restartable checkpoints