Open StrangeTcy opened 1 month ago
@StrangeTcy Did you set fp16 = True
or bf16 = True
in the trainer args?
PS if these are Kaggle install instructions - there are updated ones here: https://www.kaggle.com/danielhanchen/kaggle-llama-3-2-1b-3b-unsloth-notebook
@danielhanchen
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
),
)
-- yes, I did
I'll try that & get back to you, thanks
ETA:
/opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2833: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint_rng_state = torch.load(rng_file)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 trainer_stats = trainer.train(resume_from_checkpoint = True)
File <string>:140, in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
File <string>:425, in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
File /opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py:159, in AcceleratedOptimizer.step(self, closure)
156 if self.scaler is not None:
157 self.optimizer.step = self._optimizer_patched_step_method
--> 159 self.scaler.step(self.optimizer, closure)
160 self.scaler.update()
162 if not self._accelerate_step_called:
163 # If the optimizer step was skipped, gradient overflow was detected.
File /opt/conda/lib/python3.10/site-packages/torch/amp/grad_scaler.py:454, in GradScaler.step(self, optimizer, *args, **kwargs)
448 self.unscale_(optimizer)
450 assert (
451 len(optimizer_state["found_inf_per_device"]) > 0
452 ), "No inf checks were recorded for this optimizer."
--> 454 retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
456 optimizer_state["stage"] = OptState.STEPPED
458 return retval
File /opt/conda/lib/python3.10/site-packages/torch/amp/grad_scaler.py:352, in GradScaler._maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs)
350 retval: Optional[float] = None
351 if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
--> 352 retval = optimizer.step(*args, **kwargs)
353 return retval
File /opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py:214, in patch_optimizer_step.<locals>.patched_step(*args, **kwargs)
212 def patched_step(*args, **kwargs):
213 accelerated_optimizer._accelerate_step_called = True
--> 214 return method(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:130, in LRScheduler.__init__.<locals>.patch_track_step_called.<locals>.wrap_step.<locals>.wrapper(*args, **kwargs)
128 opt = opt_ref()
129 opt._opt_called = True # type: ignore[union-attr]
--> 130 return func.__get__(opt, opt.__class__)(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py:484, in Optimizer.profile_hook_step.<locals>.wrapper(*args, **kwargs)
479 else:
480 raise RuntimeError(
481 f"{func} must return None or a tuple of (new_args, new_kwargs), but got {result}."
482 )
--> 484 out = func(*args, **kwargs)
485 self._optimizer_step_code()
487 # call optimizer step post hooks
File /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py:291, in Optimizer8bit.step(self, closure)
288 self.init_state(group, p, gindex, pindex)
290 self.prefetch_state(p)
--> 291 self.update_step(group, p, gindex, pindex)
292 torch.cuda.synchronize()
293 if self.is_paged:
294 # all paged operation are asynchronous, we need
295 # to sync to make sure all tensors are in the right state
File /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py:569, in Optimizer2State.update_step(self, group, p, gindex, pindex)
567 state["max2"], state["new_max2"] = state["new_max2"], state["max2"]
568 elif state["state1"].dtype == torch.uint8 and config["block_wise"]:
--> 569 F.optimizer_update_8bit_blockwise(
570 self.optimizer_name,
571 grad,
572 p,
573 state["state1"],
574 state["state2"],
575 config["betas"][0],
576 config["betas"][1],
577 config["betas"][2] if len(config["betas"]) >= 3 else 0.0,
578 config["alpha"],
579 config["eps"],
580 step,
581 config["lr"],
582 state["qmap1"],
583 state["qmap2"],
584 state["absmax1"],
585 state["absmax2"],
586 config["weight_decay"],
587 gnorm_scale=gnorm_scale,
588 skip_zeros=config["skip_zeros"],
589 )
File /opt/conda/lib/python3.10/site-packages/bitsandbytes/functional.py:1843, in optimizer_update_8bit_blockwise(optimizer_name, g, p, state1, state2, beta1, beta2, beta3, alpha, eps, step, lr, qmap1, qmap2, absmax1, absmax2, weight_decay, gnorm_scale, skip_zeros)
1832 is_on_gpu([p, g, state1, state2, qmap1, qmap2, absmax1, absmax2])
1834 prev_device = pre_call(g.device)
1835 optim_func(
1836 get_ptr(p),
1837 get_ptr(g),
1838 get_ptr(state1),
1839 get_ptr(state2),
1840 ct.c_float(beta1),
1841 ct.c_float(beta2),
1842 ct.c_float(beta3),
-> 1843 ct.c_float(alpha),
1844 ct.c_float(eps),
1845 ct.c_int32(step),
1846 ct.c_float(lr),
1847 get_ptr(qmap1),
1848 get_ptr(qmap2),
1849 get_ptr(absmax1),
1850 get_ptr(absmax2),
1851 ct.c_float(weight_decay),
1852 ct.c_float(gnorm_scale),
1853 ct.c_bool(skip_zeros),
1854 ct.c_int32(g.numel()),
1855 )
1856 post_call(prev_device)
TypeError: must be real number, not NoneType
-- that's the error I'm getting now, with unsloth being installed the new way. My guess is something's wrong with the quantization, but it's hard to debug from within kaggle
@StrangeTcy Ok that looks like a bitsandbytes issue - will investigate
@StrangeTcy If you initially trained with bitsandbytes < 0.44 and then tried to resume training with 0.44+ this can happen. I would recommend trying again with bitsandbytes==0.43.3
.
@danielhanchen I also saw this question come up on Discord. Will try to have that fixed in the next bitsandbytes patch release.
@matthewdouglas interesting; I'll try that & report back, thanks
ETA: yes, apparently that works, thanks!
Thanks @matthewdouglas ! :) Sorry on the issue @StrangeTcy
I'm using a slightly modified notebook (like https://colab.research.google.com/drive/1mvwsIQWDs2EdZxZQF9pRGnnOvE86MVvR?usp=sharing) to finetune a qwen2 model, specifically, my installation instructions are:
When it comes to resuming my training, the
trainer_stats = trainer.train(resume_from_checkpoint=True)
cells runs into the following error:I think that something's gone wrong compatibility-wise. I've tried using different versions of pytorch, accelerate, transformers and trl, but the issue persists.
Please advise