Training wont resuming from checkpoint ( model = Idefics3ForConditionalGeneration.from_pretrained() )

aeltorio commented 1 week ago

System Info

transformers version: 4.47.0.dev0
Platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.35
Python version: 3.11.4
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.1.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu118 (True)
Tensorflow version (GPU?): 2.12.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: Tesla V100S-PCIE-32GB

Who can help?

@muellerzr @SunMarc

I tried to fine tune a model, since I use a preemptive VM I decided to use the resume_from_checkpoint = True and the push_to_hub = True TrainingArguments. As a predictable event the VM stopped after ≈2350 steps / 12k steps… After restarting it I'd like to continue my trainer, I rerun my notebook cells except the trainer.train() replaced by trainer.train(resume_from_checkpoint = True)
Unfortunately the train process does not want to restart with the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 1
----> 1 trainer.train(resume_from_checkpoint = True)

File ~/.miniconda3/lib/python3.11/site-packages/transformers/trainer.py:2113, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   2111 if resume_from_checkpoint is not None:
   2112     if not is_sagemaker_mp_enabled() and not self.is_deepspeed_enabled and not self.is_fsdp_enabled:
-> 2113         self._load_from_checkpoint(resume_from_checkpoint)
   2114     # In case of repeating the find_executable_batch_size, set `self._train_batch_size` properly
   2115     state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME))

File ~/.miniconda3/lib/python3.11/site-packages/transformers/trainer.py:2836, in Trainer._load_from_checkpoint(self, resume_from_checkpoint, model)
   2833         logger.warning("Could not load adapter model, make sure to have `peft>=0.3.0` installed")
   2834 else:
   2835     # We load the sharded checkpoint
-> 2836     load_result = load_sharded_checkpoint(
   2837         model, resume_from_checkpoint, strict=is_sagemaker_mp_enabled(), prefer_safe=self.args.save_safetensors
   2838     )
   2839     if not is_sagemaker_mp_enabled():
   2840         self._issue_warnings_after_load(load_result)

File ~/.miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py:504, in load_sharded_checkpoint(model, folder, strict, prefer_safe)
    500 if not index_present and not (safe_index_present and is_safetensors_available()):
    501     filenames = (
    502         (WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_INDEX_NAME) if is_safetensors_available() else (WEIGHTS_INDEX_NAME,)
    503     )
--> 504     raise ValueError(f"Can't find a checkpoint index ({' or '.join(filenames)}) in {folder}.")
    506 load_safe = False
    507 if safe_index_present:

ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in /workspace/IDEFICS3_ROCO/checkpoint-2350.

I also tried to specify the exact path of the working directory ( /workspace/IDEFICS3_ROCO ) or the latest checkpoint path ( /IDEFICS3_ROCO/checkpoint-2350)

on my Hugging Face repo the trainer commited 235 commits with name "Training in progress, step xxx0"

When I look to the content of the VM directory I have:

(base) ovh@job-5876edc4-c078-4de9-b7e3-4ebb320c0908:~/IDEFICS3_ROCO$ ls -la
total 82444
drwxr-xr-x  7 ovh ovh       11 Nov  8 17:21 .
drwxr-x--- 17 ovh ovh       21 Nov  8 17:19 ..
drwxr-xr-x  9 ovh ovh       12 Nov  8 17:19 .git
-rw-r--r--  1 ovh ovh     1519 Nov  8 17:19 .gitattributes
drwxr-xr-x  2 ovh ovh        0 Nov  8 17:21 .ipynb_checkpoints
-rw-r--r--  1 ovh ovh     3973 Nov  8 17:19 README.md
-rw-r--r--  1 ovh ovh   459271 Nov  8 17:19 ROCO-idefics3.ipynb
-rw-r--r--  1 ovh ovh      741 Nov  8 17:19 adapter_config.json
-rw-r--r--  1 ovh ovh 83950224 Nov  8 17:19 adapter_model.safetensors
drwxr-xr-x  2 ovh ovh        8 Nov  8 15:56 checkpoint-2330
drwxr-xr-x  2 ovh ovh        8 Nov  8 15:57 checkpoint-2340
drwxr-xr-x  2 ovh ovh        8 Nov  8 15:58 checkpoint-2350
-rw-r--r--  1 ovh ovh     5368 Nov  8 17:19 training_args.bin

(base) ovh@job-5876edc4-c078-4de9-b7e3-4ebb320c0908:~/IDEFICS3_ROCO$ ls -la checkpoint-2350/
total 124204
drwxr-xr-x 2 ovh ovh        8 Nov  8 15:58 .
drwxr-xr-x 7 ovh ovh       11 Nov  8 17:21 ..
-rw-r--r-- 1 ovh ovh      741 Nov  8 15:58 adapter_config.json
-rw-r--r-- 1 ovh ovh 83950224 Nov  8 15:58 adapter_model.safetensors
-rw-r--r-- 1 ovh ovh      198 Nov  8 15:58 generation_config.json
-rw-r--r-- 1 ovh ovh 43128276 Nov  8 15:58 optimizer.pt
-rw-r--r-- 1 ovh ovh    14244 Nov  8 15:58 rng_state.pth
-rw-r--r-- 1 ovh ovh     1064 Nov  8 15:58 scheduler.pt
-rw-r--r-- 1 ovh ovh    82570 Nov  8 15:58 trainer_state.json
-rw-r--r-- 1 ovh ovh     5368 Nov  8 15:58 training_args.bin
# other checkoint folders are similar

my full notebook is:

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Full Colab notebook

https://colab.research.google.com/#fileId=https://huggingface.co/eltorio/IDEFICS3_ROCO/blob/main/ROCO-idefics3.ipynb

from transformers import TrainingArguments, Trainer
import torch
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration

DEVICE = "cuda:0"
USE_LORA = False
USE_QLORA = True

processor = AutoProcessor.from_pretrained(
    source_model_id,
    do_image_splitting=False
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16
        )
    model = Idefics3ForConditionalGeneration.from_pretrained(
        source_model_id,
        torch_dtype=torch.float16,
        quantization_config=bnb_config if USE_QLORA else None,
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
else:
    model = Idefics3ForConditionalGeneration.from_pretrained(
        source_model_id,
        torch_dtype=torch.float16,
        _attn_implementation="flash_attention_2", # This works for A100 or H100
    ).to(DEVICE)

training_args = TrainingArguments(
    output_dir = output_dir,
    overwrite_output_dir = False,
    auto_find_batch_size = True,
    learning_rate = 2e-4,
    fp16 = True,
    per_device_train_batch_size = 2,
    per_device_eval_batch_size = 2,
    gradient_accumulation_steps = 8,
    dataloader_pin_memory = False,
    save_total_limit = 3,
    evaluation_strategy = None,
    save_strategy = "steps",
    eval_steps = 100,
    save_steps = 10, # checkpoint each 10 steps
    resume_from_checkpoint = True,
    logging_steps = 5,
    remove_unused_columns = False,
    push_to_hub = True,
    label_names = ["labels"],
    load_best_model_at_end = False,
    report_to = "none",
    optim = "paged_adamw_8bit",
)
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
)
trainer.train(resume_from_checkpoint = True)

Expected behavior

Trainer should create restartable checkpoints

herokukms commented 1 week ago

@aeltorio , @SunMarc Unfortunately restarting from checkpoint does not work. Despite multiple attempts, I was unable to successfully restart a halted job.

@aeltorio, In light of this issue, I would like to recommend, @aeltorio, that you consider using a Virtual Machine (VM) with guaranteed compute resources instead of a preemptible VM. This approach may help mitigate the problem and ensure a more stable computing environment.

aeltorio commented 1 week ago

@herokukms ,

using a Virtual Machine (VM) with guaranteed compute resources instead of a preemptible VM

@herokukms thank you for your message regarding the use of a Virtual Machine (VM) with guaranteed compute resources as an alternative to a preemptible VM. However, I'm afraid this solution may not be the most suitable for my current needs.

As I am a self-employed individual working on a research project, I have to be mindful of my expenses. Unfortunately, guaranteed V100 VMs are three times more expensive than preemptible VMs. Given that each run lasts approximately 20 hours and I anticipate having to make adjustments after the initial attempt, I had budgeted for three fine-tuning runs. Unless I receive a donation of 60 hours of guaranteed V100 VM 😁 (you ? 😉), I still require a more cost-effective solution.

Furthermore, I would like to reiterate the importance of finding a solution for restarting a failed job from the last checkpoint. I would appreciate it if you could provide me with an update on this matter.

Ronan

LysandreJik commented 4 days ago

cc @SunMarc @muellerzr and @BenjaminBossan it seems like the trainer is only saving the adapter weights, and the trainer is therefore failing the reloading of the checkpoint. It would be great to look into it if your bandwidths allow it :)

aeltorio commented 4 days ago

@LysandreJik

Yes, it only saves the adapter weights, which is fine once the training is complete. However, this approach removes the ability to restart the training process.

@SunMarc @muellerz @BenjaminBossan It might be beneficial to introduce an option for saving checkpoints during training. This would consume more space, so making it optional would be ideal.

Best, Ronan

BenjaminBossan commented 4 days ago

It might be beneficial to introduce an option for saving checkpoints during training. This would consume more space, so making it optional would be ideal.

I think instead, we should try to detect if it's a PEFT checkpoint. Then we can use the adapter_config.json to detect and load the base model first, then add the PEFT adapter.

For the meantime, instead of resuming from the model checkpoing, could you try loading the base model, then loading the trained LoRA adapter using model.load_adapter(<path>) and see if that works?

aeltorio commented 4 days ago

@BenjaminBossan First of all thank you for your help. Restarting is needed for me because I don't have access to reliable GPU VMs I only use some free GPU time…. I've just created a notebook with loading the adapter.
You can try it on any CUDA device it works (very slowly on my poor local RTX2060) It does not work but it might be not exactly the test you want ?
Ronan

my env is:

transformers version: 4.47.0.dev0
Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python version: 3.12.7
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.1.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 2060

The exact environment used is a Docker image ran with:

docker run --gpus all --user=42420:42420 -p 8080:8080 -e HF_TOKEN=hf_TOKEN -it sctg/roco-idefics3:0.0.5 bash -i /start.sh sleep infinity

Simply browse to http://local_or_distant_host:8080 you'll find the notebook…
The Dockerfile is here

BenjaminBossan commented 1 day ago

@aeltorio I'll investigate fixing the checkpoint issue for PEFT models in transformers, it'll probably take a bit of time.

Meanwhile, my suggestion was that if you want to resume training, don't use resume_from_checkpoint=True. Instead, load the PEFT model manually from the last checkpoint, pass it to the Trainer and continue training from there. I know it's not the same thing as fully resuming from checkpoint, but it might still unblock you for the time being.

eltorio commented 1 day ago

@BenjaminBossan Thanks a lot for your work.
For finishing my proof-of-concept model I started multiple trainings each time from the previous model with a subset of the dataset.

Ronan

huggingface / transformers