unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.86k stars 837 forks source link

Error in triton while running unsloth/mistral instruct v0.2 #440

Open xlar-sanjeet opened 2 months ago

xlar-sanjeet commented 2 months ago

Used the following lines for env creation


conda create --name unsloth_env python=3.10 conda activate unsloth_env

conda install pytorch-cuda=<12.1/11.8> pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

pip install --no-deps trl peft accelerate bitsandbytes


Here is the full trace back of error



==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 156,533 | Num Epochs = 3 O^O/ _/ \ Batch size per device = 1 | Gradient Accumulation steps = 4 \ / Total batch size = 4 | Total steps = 117,399 "-____-" Number of trainable parameters = 41,943,040 /tmp/tmpl4se9f_s/main.c: In function ‘list_to_cuuint64_array’: /tmp/tmpl4se9f_s/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^ /tmp/tmpl4se9f_s/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code /tmp/tmpl4se9f_s/main.c: In function ‘list_to_cuuint32_array’: /tmp/tmpl4se9f_s/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode for (Py_ssize_t i = 0; i < len; i++) { ^

CalledProcessError Traceback (most recent call last) Cell In[26], line 1 ----> 1 trainer.train()

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:361, in SFTTrainer.train(self, *args, *kwargs) 358 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune: 359 self.model = self._trl_activate_neftune(self.model) --> 361 output = super().train(args, **kwargs) 363 # After training we make sure to retrieve back the original forward pass method 364 # for the embedding layer by removing the forward post hook. 365 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/transformers/trainer.py:1859, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1857 hf_hub_utils.enable_progress_bars() 1858 else: -> 1859 return inner_training_loop( 1860 args=args, 1861 resume_from_checkpoint=resume_from_checkpoint, 1862 trial=trial, 1863 ignore_keys_for_eval=ignore_keys_for_eval, 1864 )

File :361, in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/transformers/trainer.py:3138, in Trainer.training_step(self, model, inputs) 3135 return loss_mb.reduce_mean().detach().to(self.args.device) 3137 with self.compute_loss_context_manager(): -> 3138 loss = self.compute_loss(model, inputs) 3140 if self.args.n_gpu > 1: 3141 loss = loss.mean() # mean() to average on multi-gpu parallel training

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/transformers/trainer.py:3161, in Trainer.compute_loss(self, model, inputs, return_outputs) 3159 else: 3160 labels = None -> 3161 outputs = model(**inputs) 3162 # Save past state if it exists 3163 # TODO: this needs to be fixed and made cleaner later. 3164 if self.args.past_index >= 0:

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, kwargs) 1530 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1531 else: -> 1532 return self._call_impl(args, kwargs)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, *kwargs) 1536 # If we don't have any hooks, we want to skip the rest of the logic in 1537 # this function, and just call forward. 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1539 or _global_backward_pre_hooks or _global_backward_hooks 1540 or _global_forward_hooks or _global_forward_pre_hooks): -> 1541 return forward_call(args, **kwargs) 1543 try: 1544 result = None

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/accelerate/utils/operations.py:822, in convert_outputs_to_fp32..forward(*args, kwargs) 821 def forward(*args, *kwargs): --> 822 return model_forward(args, kwargs)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/accelerate/utils/operations.py:810, in ConvertOutputsToFp32.call(self, *args, kwargs) 809 def call(self, *args, *kwargs): --> 810 return convert_to_fp32(self.model_forward(args, kwargs))

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/amp/autocast_mode.py:16, in autocast_decorator..decorate_autocast(*args, kwargs) 13 @functools.wraps(func) 14 def decorate_autocast(*args, *kwargs): 15 with autocast_instance: ---> 16 return func(args, kwargs)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py:882, in PeftModelForCausalLM_fast_forward(self, input_ids, causal_mask, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, kwargs) 869 def PeftModelForCausalLM_fast_forward( 870 self, 871 input_ids=None, (...) 880 kwargs, 881 ): --> 882 return self.base_model( 883 input_ids=input_ids, 884 causal_mask=causal_mask, 885 attention_mask=attention_mask, 886 inputs_embeds=inputs_embeds, 887 labels=labels, 888 output_attentions=output_attentions, 889 output_hidden_states=output_hidden_states, 890 return_dict=return_dict, 891 **kwargs, 892 )

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, kwargs) 1530 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1531 else: -> 1532 return self._call_impl(args, kwargs)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, *kwargs) 1536 # If we don't have any hooks, we want to skip the rest of the logic in 1537 # this function, and just call forward. 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1539 or _global_backward_pre_hooks or _global_backward_hooks 1540 or _global_forward_hooks or _global_forward_pre_hooks): -> 1541 return forward_call(args, **kwargs) 1543 try: 1544 result = None

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:161, in BaseTuner.forward(self, *args, kwargs) 160 def forward(self, *args: Any, *kwargs: Any): --> 161 return self.model.forward(args, kwargs)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/accelerate/hooks.py:166, in add_hook_to_module..new_forward(module, *args, kwargs) 164 output = module._old_forward(*args, *kwargs) 165 else: --> 166 output = module._old_forward(args, kwargs) 167 return module._hf_hook.post_forward(module, output)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/mistral.py:213, in MistralForCausalLM_fast_forward(self, input_ids, causal_mask, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, *args, **kwargs) 205 outputs = LlamaModel_fast_forward_inference( 206 self, 207 input_ids, (...) 210 attention_mask = attention_mask, 211 ) 212 else: --> 213 outputs = self.model( 214 input_ids=input_ids, 215 causal_mask=causal_mask, 216 attention_mask=attention_mask, 217 position_ids=position_ids, 218 past_key_values=past_key_values, 219 inputs_embeds=inputs_embeds, 220 use_cache=use_cache, 221 output_attentions=output_attentions, 222 output_hidden_states=output_hidden_states, 223 return_dict=return_dict, 224 ) 225 pass 227 hidden_states = outputs[0]

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, kwargs) 1530 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1531 else: -> 1532 return self._call_impl(args, kwargs)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, *kwargs) 1536 # If we don't have any hooks, we want to skip the rest of the logic in 1537 # this function, and just call forward. 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1539 or _global_backward_pre_hooks or _global_backward_hooks 1540 or _global_forward_hooks or _global_forward_pre_hooks): -> 1541 return forward_call(args, **kwargs) 1543 try: 1544 result = None

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/accelerate/hooks.py:166, in add_hook_to_module..new_forward(module, *args, kwargs) 164 output = module._old_forward(*args, *kwargs) 165 else: --> 166 output = module._old_forward(args, kwargs) 167 return module._hf_hook.post_forward(module, output)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py:650, in LlamaModel_fast_forward(self, input_ids, causal_mask, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, *args, **kwargs) 647 past_key_value = past_key_values[idx] if past_key_values is not None else None 649 if offloaded_gradient_checkpointing: --> 650 hidden_states = Unsloth_Offloaded_Gradient_Checkpointer.apply( 651 decoder_layer, 652 hidden_states, 653 causal_mask, 654 attention_mask, 655 position_ids, 656 past_key_values, 657 output_attentions, 658 use_cache, 659 ) 661 elif gradient_checkpointing: 662 def create_custom_forward(module):

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/autograd/function.py:598, in Function.apply(cls, *args, *kwargs) 595 if not torch._C._are_functorch_transforms_active(): 596 # See NOTE: [functorch vjp and autograd interaction] 597 args = _functorch.utils.unwrap_dead_wrappers(args) --> 598 return super().apply(args, **kwargs) # type: ignore[misc] 600 if not is_setup_ctx_defined: 601 raise RuntimeError( 602 "In order to use an autograd.Function with functorch transforms " 603 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context " 604 "staticmethod. For more details, please see " 605 "https://pytorch.org/docs/master/notes/extending.func.html" 606 )

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py:115, in custom_fwd..decorate_fwd(*args, *kwargs) 113 if cast_inputs is None: 114 args[0]._fwd_used_autocast = torch.is_autocast_enabled() --> 115 return fwd(args, **kwargs) 116 else: 117 autocast_context = torch.is_autocast_enabled()

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/_utils.py:369, in Unsloth_Offloaded_Gradient_Checkpointer.forward(ctx, forward_function, hidden_states, args) 367 saved_hidden_states = hidden_states.to("cpu", non_blocking = True) 368 with torch.no_grad(): --> 369 (output,) = forward_function(hidden_states, args) 370 ctx.save_for_backward(saved_hidden_states) 371 ctx.forward_function = forward_function

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, kwargs) 1530 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1531 else: -> 1532 return self._call_impl(args, kwargs)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, *kwargs) 1536 # If we don't have any hooks, we want to skip the rest of the logic in 1537 # this function, and just call forward. 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1539 or _global_backward_pre_hooks or _global_backward_hooks 1540 or _global_forward_hooks or _global_forward_pre_hooks): -> 1541 return forward_call(args, **kwargs) 1543 try: 1544 result = None

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/accelerate/hooks.py:166, in add_hook_to_module..new_forward(module, *args, kwargs) 164 output = module._old_forward(*args, *kwargs) 165 else: --> 166 output = module._old_forward(args, kwargs) 167 return module._hf_hook.post_forward(module, output)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/unsloth/models/llama.py:432, in LlamaDecoderLayer_fast_forward(self, hidden_states, causal_mask, attention_mask, position_ids, past_key_value, output_attentions, use_cache, padding_mask, *args, **kwargs) 430 else: 431 residual = hidden_states --> 432 hidden_states = fast_rms_layernorm(self.input_layernorm, hidden_states) 433 hidden_states, self_attn_weights, present_key_value = self.self_attn( 434 hidden_states=hidden_states, 435 causal_mask=causal_mask, (...) 441 padding_mask=padding_mask, 442 ) 443 hidden_states = residual + hidden_states

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/unsloth/kernels/rms_layernorm.py:190, in fast_rms_layernorm(layernorm, X, gemma) 188 W = layernorm.weight 189 eps = layernorm.variance_epsilon --> 190 out = Fast_RMS_Layernorm.apply(X, W, eps, gemma) 191 return out

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/torch/autograd/function.py:598, in Function.apply(cls, *args, *kwargs) 595 if not torch._C._are_functorch_transforms_active(): 596 # See NOTE: [functorch vjp and autograd interaction] 597 args = _functorch.utils.unwrap_dead_wrappers(args) --> 598 return super().apply(args, **kwargs) # type: ignore[misc] 600 if not is_setup_ctx_defined: 601 raise RuntimeError( 602 "In order to use an autograd.Function with functorch transforms " 603 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context " 604 "staticmethod. For more details, please see " 605 "https://pytorch.org/docs/master/notes/extending.func.html" 606 )

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/unsloth/kernels/rms_layernorm.py:144, in Fast_RMS_Layernorm.forward(ctx, X, W, eps, gemma) 141 r = torch.empty(n_rows, dtype = torch.float32, device = "cuda") 143 fx = _gemma_rms_layernorm_forward if gemma else _rms_layernorm_forward --> 144 fx[(n_rows,)]( 145 Y, Y.stride(0), 146 X, X.stride(0), 147 W, W.stride(0), 148 r, r.stride(0), 149 n_cols, eps, 150 BLOCK_SIZE = BLOCK_SIZE, 151 num_warps = num_warps, 152 ) 153 ctx.eps = eps 154 ctx.BLOCK_SIZE = BLOCK_SIZE

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/runtime/jit.py:167, in KernelInterface.getitem..(*args, kwargs) 161 def getitem(self, grid) -> T: 162 """ 163 A JIT function is launched with: fn[grid](*args, *kwargs). 164 Hence JITFunction.getitem returns a callable proxy that 165 memorizes the grid. 166 """ --> 167 return lambda args, kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/runtime/jit.py:363, in JITFunction.run(self, grid, warmup, *args, **kwargs) 361 assert "stream" not in kwargs, "stream option is deprecated; current stream will be used" 362 # parse options --> 363 device = driver.get_current_device() 364 stream = driver.get_current_stream(device) 365 target = driver.get_current_target()

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/runtime/driver.py:209, in LazyProxy.getattr(self, name) 208 def getattr(self, name): --> 209 self._initialize_obj() 210 return getattr(self._obj, name)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/runtime/driver.py:206, in LazyProxy._initialize_obj(self) 204 def _initialize_obj(self): 205 if self._obj is None: --> 206 self._obj = self._init_fn()

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/runtime/driver.py:239, in initialize_driver() 237 return HIPDriver() 238 elif torch.cuda.is_available(): --> 239 return CudaDriver() 240 else: 241 return UnsupportedDriver()

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/runtime/driver.py:102, in CudaDriver.init(self) 101 def init(self): --> 102 self.utils = CudaUtils() 103 self.backend = self.CUDA 104 self.binary_ext = "cubin"

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/runtime/driver.py:49, in CudaUtils.init(self) 47 with open(src_path, "w") as f: 48 f.write(src) ---> 49 so = _build("cuda_utils", src_path, tmpdir) 50 with open(so, "rb") as f: 51 cache_path = cache.put(f.read(), fname, binary=True)

File ~/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/common/build.py:106, in _build(name, src, srcdir) 101 cc_cmd = [ 102 cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda", 103 "-o", so 104 ] 105 cc_cmd += [f"-L{dir}" for dir in cuda_lib_dirs] --> 106 ret = subprocess.check_call(cc_cmd) 108 if ret == 0: 109 return so

File ~/.conda/envs/unsloth_env/lib/python3.10/subprocess.py:369, in check_call(*popenargs, **kwargs) 367 if cmd is None: 368 cmd = popenargs[0] --> 369 raise CalledProcessError(retcode, cmd) 370 return 0

CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpl4se9f_s/main.c', '-O3', '-I/home/chemical/phd/chz208394/.conda/envs/unsloth_env/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/home/chemical/phd/chz208394/.conda/envs/unsloth_env/include/python3.10', '-I/tmp/tmpl4se9f_s', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpl4se9f_s/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-L/lib64', '-L/lib', '-L/lib64', '-L/lib']' returned non-zero exit status 1.



Any help will be greatly appreciated.

Thank you

danielhanchen commented 2 months ago

Oh thats a weird error - i will try Conda installs and get back to you

xlar-sanjeet commented 2 months ago

Thank you for the reply. Few hours back the issue was resolved by loading a gcc compiler. No need to worry about. Thanks

vincent775 commented 2 months ago

Hello, I also encountered the same problem. Can you explain in detail how to solve it? Thank you so much

xlar-sanjeet commented 2 months ago

are u working on a local system ?

On Fri, 17 May 2024 at 15:17, vincent775 @.***> wrote:

Hello, I also encountered the same problem. Can you explain in detail how to solve it? Thank you so much

— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/440#issuecomment-2117167320, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUKYQWTGFF6KQFMNXND7C3TZCXG2XAVCNFSM6AAAAABHNDEUESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJXGE3DOMZSGA . You are receiving this because you authored the thread.Message ID: @.***>

-- Sanjeet Patil Research Scholar IIT DELHI

vincent775 commented 2 months ago

On the aws server。 image

vincent775 commented 2 months ago

The environment I just created

xlar-sanjeet commented 2 months ago

Even I was working on a remote server.

My issue resolved after loading the following things into the environment

1) lib/isl/0.18/gnu 3) compiler/gcc/9.1/openmpi/4.1.2 2) compiler/gcc/9.1.0

On Fri, 17 May 2024 at 15:25, vincent775 @.***> wrote:

The environment I just created

— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/440#issuecomment-2117186137, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUKYQWRH7Z2LP2RB2FVN2HDZCXH2RAVCNFSM6AAAAABHNDEUESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJXGE4DMMJTG4 . You are receiving this because you authored the thread.Message ID: @.***>

-- Sanjeet Patil Research Scholar IIT DELHI

IanSmith123 commented 1 week ago

https://github.com/QwenLM/Qwen/issues/1199#issuecomment-2046492182

pip uninstall triton