unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.59k stars 1.3k forks source link

Bitsandbytes issue #1080

Open StrangeTcy opened 1 month ago

StrangeTcy commented 1 month ago

I'm using a slightly modified notebook (like https://colab.research.google.com/drive/1mvwsIQWDs2EdZxZQF9pRGnnOvE86MVvR?usp=sharing) to finetune a qwen2 model, specifically, my installation instructions are:

#%%capture

!mamba install --force-reinstall aiohttp -y
!pip install -U "xformers<0.0.26" --index-url https://download.pytorch.org/whl/cu121
!pip install --upgrade "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"

# Temporary fix for https://github.com/huggingface/datasets/issues/6753
!pip3 install datasets==2.16.0 fsspec==2023.10.0 gcsfs==2023.10.0

!pip3 install -U wandb

When it comes to resuming my training, the trainer_stats = trainer.train(resume_from_checkpoint=True) cells runs into the following error:

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,233,923 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 2
\        /    Total batch size = 2 | Total steps = 6,350,883
 "-____-"     Number of trainable parameters = 40,370,176

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[24], line 1
----> 1 trainer_stats = trainer.train(resume_from_checkpoint=True)

File <string>:140, in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)

File <string>:404, in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)

File /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:2387, in Accelerator.clip_grad_norm_(self, parameters, max_norm, norm_type)
   2385             if parameters == [p for p in model.parameters()]:
   2386                 return model.clip_grad_norm_(max_norm, norm_type)
-> 2387 self.unscale_gradients()
   2388 return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)

File /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:2331, in Accelerator.unscale_gradients(self, optimizer)
   2329 while isinstance(opt, AcceleratedOptimizer):
   2330     opt = opt.optimizer
-> 2331 self.scaler.unscale_(opt)

File /opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:325, in unscale_(self, optimizer)
      0 <Error retrieving source code with stack_data see ipython/ipython#13598>

RuntimeError: unscale_() has already been called on this optimizer since the last update().

I think that something's gone wrong compatibility-wise. I've tried using different versions of pytorch, accelerate, transformers and trl, but the issue persists.

Please advise

danielhanchen commented 1 month ago

@StrangeTcy Did you set fp16 = True or bf16 = True in the trainer args?

PS if these are Kaggle install instructions - there are updated ones here: https://www.kaggle.com/danielhanchen/kaggle-llama-3-2-1b-3b-unsloth-notebook

StrangeTcy commented 1 month ago

@danielhanchen

  1. trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
    
    ),
    )

    -- yes, I did

  2. I'll try that & get back to you, thanks

ETA:

/opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2833: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint_rng_state = torch.load(rng_file)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 trainer_stats = trainer.train(resume_from_checkpoint = True)

File <string>:140, in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)

File <string>:425, in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)

File /opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py:159, in AcceleratedOptimizer.step(self, closure)
    156 if self.scaler is not None:
    157     self.optimizer.step = self._optimizer_patched_step_method
--> 159     self.scaler.step(self.optimizer, closure)
    160     self.scaler.update()
    162     if not self._accelerate_step_called:
    163         # If the optimizer step was skipped, gradient overflow was detected.

File /opt/conda/lib/python3.10/site-packages/torch/amp/grad_scaler.py:454, in GradScaler.step(self, optimizer, *args, **kwargs)
    448     self.unscale_(optimizer)
    450 assert (
    451     len(optimizer_state["found_inf_per_device"]) > 0
    452 ), "No inf checks were recorded for this optimizer."
--> 454 retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
    456 optimizer_state["stage"] = OptState.STEPPED
    458 return retval

File /opt/conda/lib/python3.10/site-packages/torch/amp/grad_scaler.py:352, in GradScaler._maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs)
    350 retval: Optional[float] = None
    351 if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
--> 352     retval = optimizer.step(*args, **kwargs)
    353 return retval

File /opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py:214, in patch_optimizer_step.<locals>.patched_step(*args, **kwargs)
    212 def patched_step(*args, **kwargs):
    213     accelerated_optimizer._accelerate_step_called = True
--> 214     return method(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:130, in LRScheduler.__init__.<locals>.patch_track_step_called.<locals>.wrap_step.<locals>.wrapper(*args, **kwargs)
    128 opt = opt_ref()
    129 opt._opt_called = True  # type: ignore[union-attr]
--> 130 return func.__get__(opt, opt.__class__)(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py:484, in Optimizer.profile_hook_step.<locals>.wrapper(*args, **kwargs)
    479         else:
    480             raise RuntimeError(
    481                 f"{func} must return None or a tuple of (new_args, new_kwargs), but got {result}."
    482             )
--> 484 out = func(*args, **kwargs)
    485 self._optimizer_step_code()
    487 # call optimizer step post hooks

File /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py:291, in Optimizer8bit.step(self, closure)
    288             self.init_state(group, p, gindex, pindex)
    290         self.prefetch_state(p)
--> 291         self.update_step(group, p, gindex, pindex)
    292         torch.cuda.synchronize()
    293 if self.is_paged:
    294     # all paged operation are asynchronous, we need
    295     # to sync to make sure all tensors are in the right state

File /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py:569, in Optimizer2State.update_step(self, group, p, gindex, pindex)
    567     state["max2"], state["new_max2"] = state["new_max2"], state["max2"]
    568 elif state["state1"].dtype == torch.uint8 and config["block_wise"]:
--> 569     F.optimizer_update_8bit_blockwise(
    570         self.optimizer_name,
    571         grad,
    572         p,
    573         state["state1"],
    574         state["state2"],
    575         config["betas"][0],
    576         config["betas"][1],
    577         config["betas"][2] if len(config["betas"]) >= 3 else 0.0,
    578         config["alpha"],
    579         config["eps"],
    580         step,
    581         config["lr"],
    582         state["qmap1"],
    583         state["qmap2"],
    584         state["absmax1"],
    585         state["absmax2"],
    586         config["weight_decay"],
    587         gnorm_scale=gnorm_scale,
    588         skip_zeros=config["skip_zeros"],
    589     )

File /opt/conda/lib/python3.10/site-packages/bitsandbytes/functional.py:1843, in optimizer_update_8bit_blockwise(optimizer_name, g, p, state1, state2, beta1, beta2, beta3, alpha, eps, step, lr, qmap1, qmap2, absmax1, absmax2, weight_decay, gnorm_scale, skip_zeros)
   1832 is_on_gpu([p, g, state1, state2, qmap1, qmap2, absmax1, absmax2])
   1834 prev_device = pre_call(g.device)
   1835 optim_func(
   1836     get_ptr(p),
   1837     get_ptr(g),
   1838     get_ptr(state1),
   1839     get_ptr(state2),
   1840     ct.c_float(beta1),
   1841     ct.c_float(beta2),
   1842     ct.c_float(beta3),
-> 1843     ct.c_float(alpha),
   1844     ct.c_float(eps),
   1845     ct.c_int32(step),
   1846     ct.c_float(lr),
   1847     get_ptr(qmap1),
   1848     get_ptr(qmap2),
   1849     get_ptr(absmax1),
   1850     get_ptr(absmax2),
   1851     ct.c_float(weight_decay),
   1852     ct.c_float(gnorm_scale),
   1853     ct.c_bool(skip_zeros),
   1854     ct.c_int32(g.numel()),
   1855 )
   1856 post_call(prev_device)

TypeError: must be real number, not NoneType

-- that's the error I'm getting now, with unsloth being installed the new way. My guess is something's wrong with the quantization, but it's hard to debug from within kaggle

danielhanchen commented 1 month ago

@StrangeTcy Ok that looks like a bitsandbytes issue - will investigate

matthewdouglas commented 1 month ago

@StrangeTcy If you initially trained with bitsandbytes < 0.44 and then tried to resume training with 0.44+ this can happen. I would recommend trying again with bitsandbytes==0.43.3.

@danielhanchen I also saw this question come up on Discord. Will try to have that fixed in the next bitsandbytes patch release.

StrangeTcy commented 1 month ago

@matthewdouglas interesting; I'll try that & report back, thanks

ETA: yes, apparently that works, thanks!

danielhanchen commented 1 month ago

Thanks @matthewdouglas ! :) Sorry on the issue @StrangeTcy