lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
37.03k stars 4.56k forks source link

AssertionError: No inf checks were recorded for this optimizer. #522

Open samarthsarin opened 1 year ago

samarthsarin commented 1 year ago

While tuning I am getting the following error. AssertionError: No inf checks were recorded for this optimizer.

Can anyone help me with this? Here are my training arguments: per_device_train_batch_size=2, warmup_steps=100, num_train_epochs=3, fp16=True, logging_steps=1, output_dir='llama_output/', gradient_accumulation_steps = 2, evaluation_strategy = "no", save_strategy = "no", save_steps = 1200, learning_rate = 2e-5, weight_decay = 0., warmup_ratio = 0.03, lr_scheduler_type = "cosine", tf32 = False, gradient_checkpointing = False

samarthsarin commented 1 year ago

@merrymercy any solution for this? Please help

zhisbug commented 1 year ago

@samarthsarin Please provide more information (full stack trace); it is hard to help by only seeing an assertion error.

samarthsarin commented 1 year ago

Hi @zhisbug @merrymercy Here are all the steps I followed for the fine tuning. I am using train.py file for fine tuning using the dummy.json file which is provided in the readme section. I have limited capacity GPU memory (16GB) hence I cannot load the full model in the memory, so I have slightly modified the code in order to convert and load it in 8bits using peft and bitsandbytes. The full error is as follows: ──────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ :1 in │ │ :253 in train │ │ │ │ /opt/conda/lib/python3.9/site-packages/transformers/trainer.py:1661 in train │ │ │ │ 1658 │ │ inner_training_loop = find_executable_batch_size( │ │ 1659 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1660 │ │ ) │ │ ❱ 1661 │ │ return inner_training_loop( │ │ 1662 │ │ │ args=args, │ │ 1663 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1664 │ │ │ trial=trial, │ │ │ │ /opt/conda/lib/python3.9/site-packages/transformers/trainer.py:1990 in _inner_training_loop │ │ │ │ 1987 │ │ │ │ │ │ │ xm.optimizer_step(self.optimizer) │ │ 1988 │ │ │ │ │ elif self.do_grad_scaling: │ │ 1989 │ │ │ │ │ │ scale_before = self.scaler.get_scale() │ │ ❱ 1990 │ │ │ │ │ │ self.scaler.step(self.optimizer) │ │ 1991 │ │ │ │ │ │ self.scaler.update() │ │ 1992 │ │ │ │ │ │ scale_after = self.scaler.get_scale() │ │ 1993 │ │ │ │ │ │ optimizer_was_run = scale_before <= scale_after │ │ │ │ /opt/conda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:339 in step │ │ │ │ 336 │ │ if optimizerstate["stage"] is OptState.READY: │ │ 337 │ │ │ self.unscale(optimizer) │ │ 338 │ │ │ │ ❱ 339 │ │ assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were rec │ │ 340 │ │ │ │ 341 │ │ retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs) │ │ 342 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AssertionError: No inf checks were recorded for this optimizer.

I am also attaching the jupyter notebook if you want to have a look. My env: Python 3.9 Cuda 11.7

Vicu 1.1 Finetuning.zip