unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.41k stars 1.29k forks source link

RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead #1184

Open Brightatkmitl opened 4 weeks ago

Brightatkmitl commented 4 weeks ago

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. ==((====))== Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2. \ /| GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux. O^O/ _/ \ Pytorch: 2.5.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1. \ / Bfloat16 = TRUE. FA [Xformers = 0.0.28.post2. FA2 = False] "-____-" Free Apache license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! model.safetensors: 100%  5.70G/5.70G [00:15<00:00, 514MB/s] generation_config.json: 100%  198/198 [00:00<00:00, 17.2kB/s] tokenizer_config.json: 100%  50.6k/50.6k [00:00<00:00, 240kB/s] tokenizer.json: 100%  9.09M/9.09M [00:01<00:00, 5.98MB/s] special_tokens_map.json: 100%  350/350 [00:00<00:00, 29.3kB/s] Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version! Please update transformers, TRL and unsloth via: pip install --upgrade --no-cache-dir unsloth git+https://github.com/huggingface/transformers.git git+[https://github.com/huggingface/trl.git](https://github.com/huggingface/trl.git%60)

==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 82,314 | Num Epochs = 1 O^O/ _/ \ Batch size per device = 2 | Gradient Accumulation steps = 4 \ / Total batch size = 8 | Total steps = 60 "-____-" Number of trainable parameters = 41,943,040 **** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth! pip install --upgrade --no-cache-dir unsloth git+https://github.com/huggingface/transformers.git git+[https://github.com/huggingface/trl.git](https://github.com/huggingface/trl.git%60)

RuntimeError Traceback (most recent call last) in <cell line: 2>() 1 # Train the model ----> 2 trainer.train() 3 4 # Evaluate the model on the validation dataset 5 val_results = trainer.evaluate(eval_dataset=val_ds)

14 frames /usr/local/lib/python3.10/dist-packages/unsloth/kernels/fast_lora.py in backward(ctx, dY) 134 # dX += matmul_lora(de, gateW.t(), gateW_quant, gateB, gateA, gateS) 135 upW = fast_dequantize(upW.t(), upW_quant) --> 136 dX = torch.matmul(df, upW.t(), out = X if ctx.inplace else None) 137 del upW 138 dX += df @ upB.to(dtype).t() @ (upS * upA.to(dtype).t())

RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead

This is my notebook https://colab.research.google.com/drive/11YeMfZmm9HKtNNIqC9WzXAfyO4LQEoqg?usp=sharing

danielhanchen commented 4 weeks ago

@Brightatkmitl Your notebook looks fine?

Brightatkmitl commented 3 weeks ago

@danielhanchen not fine any suggestion to help TT

Brightatkmitl commented 3 weeks ago

RuntimeError Traceback (most recent call last) in <cell line: 1>() ----> 1 trainer.train()

14 frames /usr/local/lib/python3.10/dist-packages/unsloth/tokenizer_utils.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)

/usr/local/lib/python3.10/dist-packages/unsloth/models/llama.py in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in training_step(failed resolving arguments) 3347 scaled_loss.backward() 3348 else: -> 3349 self.accelerator.backward(loss, **kwargs) 3350 3351 return loss.detach() / self.args.gradient_accumulation_steps

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py in backward(self, loss, kwargs) 2194 self.lomo_backward(loss, learning_rate) 2195 else: -> 2196 loss.backward(kwargs) 2197 2198 def set_trigger(self):

/usr/local/lib/python3.10/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs) 579 inputs=inputs, 580 ) --> 581 torch.autograd.backward( 582 self, gradient, retain_graph, create_graph, inputs=inputs 583 )

/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 345 # some Python versions print out the first line of a multi-line function 346 # calls in the traceback and some print out the last line --> 347 _engine_run_backward( 348 tensors, 349 gradtensors,

/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py in _engine_run_backward(t_outputs, *args, *kwargs) 823 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs) 824 try: --> 825 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 826 t_outputs, args, **kwargs 827 ) # Calls into the C++ engine to run the backward pass

/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py in apply(self, args) 305 ) 306 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn --> 307 return user_fn(self, args) 308 309 def apply_jvp(self, *args):

/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py in decorate_bwd(*args, *kwargs) 509 dtype=args[0]._dtype, 510 ): --> 511 return bwd(args, **kwargs) 512 513 return decorate_bwd

/usr/local/lib/python3.10/dist-packages/unsloth/models/_utils.py in backward(ctx, dY) 820 with torch.enable_grad(): 821 (output,) = ctx.forward_function(hidden_states, ctx.args) --> 822 torch.autograd.backward(output, dY) 823 return (None, hidden_states.grad,) + (None,)len(ctx.args) 824 pass

/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 345 # some Python versions print out the first line of a multi-line function 346 # calls in the traceback and some print out the last line --> 347 _engine_run_backward( 348 tensors, 349 gradtensors,

/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py in _engine_run_backward(t_outputs, *args, *kwargs) 823 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs) 824 try: --> 825 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 826 t_outputs, args, **kwargs 827 ) # Calls into the C++ engine to run the backward pass

/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py in apply(self, args) 305 ) 306 user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn --> 307 return user_fn(self, args) 308 309 def apply_jvp(self, *args):

/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py in decorate_bwd(*args, *kwargs) 509 dtype=args[0]._dtype, 510 ): --> 511 return bwd(args, **kwargs) 512 513 return decorate_bwd

/usr/local/lib/python3.10/dist-packages/unsloth/kernels/fast_lora.py in backward(ctx, dY) 134 # dX += matmul_lora(de, gateW.t(), gateW_quant, gateB, gateA, gateS) 135 upW = fast_dequantize(upW.t(), upW_quant) --> 136 dX = torch.matmul(df, upW.t(), out = X if ctx.inplace else None) 137 del upW 138 dX += df @ upB.to(dtype).t() @ (upS * upA.to(dtype).t())

RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead

danielhanchen commented 3 weeks ago

@Brightatkmitl Did you set bf16 = True in the training args?

Erland366 commented 3 weeks ago

This issue is because you use dtype = torch.float16 whereas your GPU supports bfloat16. I think somewhere in the codebase of Unsloth, it uses autodetection so it uses bfloat16 instead of float16 like the user inputs. I am thinking of fixing this but I am not sure the usecase of using float16 if your GPU supports bfloat16

wdyt @danielhanchen