unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.41k stars 1.29k forks source link

About UnslothTrainer #1145

Open Vital1162 opened 1 month ago

Vital1162 commented 1 month ago

after the gradient accumulation fix, I tried to continue the pre-trained Llama 3.2 3B model for my datasets but got these issues when training. Does anyone have a solution?

from unsloth import unsloth_train
trainer_stats = unsloth_train(trainer)
/usr/local/lib/python3.10/dist-packages/unsloth_zoo/training_utils.py:196: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  float16_scaler = torch.cuda.amp.GradScaler()
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
    \   /|    Num examples = 2,500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 312
 "-____-"     Number of trainable parameters = 982,515,712
  0%|          | 1/312 [00:25<2:13:19, 25.72s/it]0, 2.5521
  4%|▎         | 11/312 [02:27<1:00:48, 12.12s/it]10, 2.2398
  7%|▋         | 21/312 [04:28<58:10, 12.00s/it]20, 2.2377
 10%|▉         | 31/312 [06:29<55:54, 11.94s/it]30, 2.1703
 13%|█▎        | 41/312 [08:33<55:08, 12.21s/it]40, 2.2717
 16%|█▋        | 51/312 [10:34<53:04, 12.20s/it]50, 2.1472
 20%|█▉        | 61/312 [12:37<51:23, 12.29s/it]60, 2.099
 23%|██▎       | 71/312 [14:40<49:20, 12.28s/it]70, 2.2203
 26%|██▌       | 81/312 [16:42<47:00, 12.21s/it]80, 2.0887
 29%|██▉       | 91/312 [18:46<45:33, 12.37s/it]90, 2.1831
 32%|███▏      | 101/312 [20:48<42:44, 12.16s/it]100, 2.1756
 36%|███▌      | 111/312 [22:51<41:39, 12.43s/it]110, 2.1127
 39%|███▉      | 121/312 [24:52<37:45, 11.86s/it]120, 2.0503
 42%|████▏     | 131/312 [26:54<36:43, 12.17s/it]130, 2.1249
 45%|████▌     | 141/312 [28:56<35:01, 12.29s/it]140, 2.1993
 48%|████▊     | 151/312 [30:58<32:48, 12.23s/it]150, 1.9419
 50%|█████     | 156/312 [31:59<31:53, 12.26s/it]
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
[<ipython-input-11-7a9b0bd5dae9>](https://localhost:8080/#) in <cell line: 4>()
      2 
      3 # trainer_stats = trainer.train()
----> 4 trainer_stats = unsloth_train(trainer)

4 frames
[/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py](https://localhost:8080/#) in _next_index(self)
    618 
    619     def _next_index(self):
--> 620         return next(self._sampler_iter)  # may raise StopIteration
    621 
    622     def _next_data(self):

StopIteration:
CurtiusSimplus commented 1 month ago

Oh I had not seen this new trainer code yet ... Okay lets try that ...Thank buddy.

CurtiusSimplus commented 1 month ago

Observation: It is just as slow as 'trainer' -- 5-6 hour eta for 280 step train.

danielhanchen commented 1 month ago

@Vital1162 We worked with the Hugging Face team to add the fix into transformers!

You'll have to use the latest transformers temporarily (you can continue using unsloth_train or just use trainer.train():

!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"
CurtiusSimplus commented 1 month ago

Thanks GUYS ... You guys do good work.

Vital1162 commented 1 month ago

@danielhanchen Thank you for your reply I haven't rechecked but when I run a trainer requires me to API of wandb

Is it still ok if I disable it?

wandb.init(mode="disabled")
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
danielhanchen commented 1 month ago

@Vital1162 Oh sorry just fixed it - see https://github.com/unslothai/unsloth/issues/1153 ie

I updated all training notebooks - please edit the TrainingArguments part by adding report_to = "none". For example:


    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        ...
    ),

should be edited to:


    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        ...
        report_to = "none", # Use this for WandB etc
    ),