unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.64k stars 821 forks source link

Using CPU when resume training from checkpoint @ patch 2024.7 #729

Open avcode-exe opened 1 week ago

avcode-exe commented 1 week ago

Hi guys! I got the following error when using Unsloth patch 2024.7 to resume training from checkpoint.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I did not occur this error when using the older version. Just curious, is it possible to install and use the older version of Unsloth?


Edit Here is the full error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[13], line 1
----> 1 trainer_stats = trainer.train("/kaggle/working/outputs/checkpoint-525")
      2 # trainer_stats = trainer.train()

File <string>:123, in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)

File <string>:422, in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)

File /opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py:157, in AcceleratedOptimizer.step(self, closure)
    154 if self.scaler is not None:
    155     self.optimizer.step = self._optimizer_patched_step_method
--> 157     self.scaler.step(self.optimizer, closure)
    158     self.scaler.update()
    160     if not self._accelerate_step_called:
    161         # If the optimizer step was skipped, gradient overflow was detected.

File /opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:452, in GradScaler.step(self, optimizer, *args, **kwargs)
    446     self.unscale_(optimizer)
    448 assert (
    449     len(optimizer_state["found_inf_per_device"]) > 0
    450 ), "No inf checks were recorded for this optimizer."
--> 452 retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
    454 optimizer_state["stage"] = OptState.STEPPED
    456 return retval

File /opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:350, in GradScaler._maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs)
    348 retval: Optional[float] = None
    349 if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
--> 350     retval = optimizer.step(*args, **kwargs)
    351 return retval

File /opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py:212, in patch_optimizer_step.<locals>.patched_step(*args, **kwargs)
    210 def patched_step(*args, **kwargs):
    211     accelerated_optimizer._accelerate_step_called = True
--> 212     return method(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:75, in LRScheduler.__init__.<locals>.with_counter.<locals>.wrapper(*args, **kwargs)
     73 instance._step_count += 1
     74 wrapped = func.__get__(instance, cls)
---> 75 return wrapped(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py:385, in Optimizer.profile_hook_step.<locals>.wrapper(*args, **kwargs)
    380         else:
    381             raise RuntimeError(
    382                 f"{func} must return None or a tuple of (new_args, new_kwargs), but got {result}."
    383             )
--> 385 out = func(*args, **kwargs)
    386 self._optimizer_step_code()
    388 # call optimizer step post hooks

File /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/galore_torch/adamw8bit.py:52, in AdamW8bit.step(self, closure)
     49     group['weight_decay_saved'] = group['weight_decay']
     50     group['weight_decay'] = 0
---> 52 grad = state["projector"].project(p.grad, state["step"])
     54 # suboptimal implementation
     55 p.saved_data = p.data.clone()

File /opt/conda/lib/python3.10/site-packages/galore_torch/galore_projector.py:22, in GaLoreProjector.project(self, full_rank_grad, iter)
     20         if self.ortho_matrix is None or iter % self.update_proj_gap == 0:
     21             self.ortho_matrix = self.get_orthogonal_matrix(full_rank_grad, self.rank, type='left')
---> 22         low_rank_grad = torch.matmul(self.ortho_matrix.t(), full_rank_grad)
     23 elif self.proj_type == 'reverse_std':
     24     if full_rank_grad.shape[0] >= full_rank_grad.shape[1]:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
danielhanchen commented 1 week ago

Oh is this for Galore?

avcode-exe commented 1 week ago

Yeah