unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.79k stars 1.07k forks source link

RuntimeError: Cannot launch Triton kernel since n = 102400 exceeds the maximum CUDA blocksize = 65536. #19

Closed FlatMapIO closed 8 months ago

FlatMapIO commented 9 months ago

Env:

Traceback:

RuntimeError                              Traceback (most recent call last)
/workspaces/unsloth-train-playground/train.ipynb Cell 6 line 1
----> 1 trainer_stats = trainer.train()

File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:280, in SFTTrainer.train(self, *args, **kwargs)
    277 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    278     self.model = self._trl_activate_neftune(self.model)
--> 280 output = super().train(*args, **kwargs)
    282 # After training we make sure to retrieve back the original forward pass method
    283 # for the embedding layer by removing the forward post hook.
    284 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/trainer.py:1555, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1553         hf_hub_utils.enable_progress_bars()
   1554 else:
-> 1555     return inner_training_loop(
   1556         args=args,
   1557         resume_from_checkpoint=resume_from_checkpoint,
   1558         trial=trial,
   1559         ignore_keys_for_eval=ignore_keys_for_eval,
   1560     )

File /workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/trainer.py:1860, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1857     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
...
     24                        f"the maximum CUDA blocksize = {MAX_FUSED_SIZE}.")
     25 num_warps = 4
     26 if   BLOCK_SIZE >= 32768: num_warps = 32

RuntimeError: Cannot launch Triton kernel since n = 102400 exceeds the maximum CUDA blocksize = 65536.
danielhanchen commented 9 months ago

@FlatMapIO Sadly known limitation - I forgot to write it as a limitation and I'm working on it.

The issue is the max CUDA threads is 2^16 or 65536, so the vocab_size for Deepseek is a bit larger (ie 102400) 102400's next power of 2 will become 2^17, which is 131072, and will not work on the current Unsloth implementation.

I'll have to rewrite the cross_entropy_loss code to support splitting the calculation into multiple grids, then do a final reduction step for logsumexp.

If this is a more popular request - I will implement it!! So more likes and stars would be helpful!!

alexconstant9108 commented 9 months ago

@danielhanchen 100% worth the effort! Deepseek-Coder-33B is an amazing coder. And then there is Deepseek-LLM-67B, which is even much better (although twice as big) than that. NO OTHER open model comes close to these two at coding tasks. In fact maybe only GPT-4 is better at such tasks, but it's knowledge cut off is from some time ago so it's less useful with some newer APIs. So fine-tuning those two Deepseek models on one's code base would be a game changer. Another game changer would be even faster inference than what exllamaV2 can achieve, but that's another topic :))

danielhanchen commented 9 months ago

@FlatMapIO Ohh ok ok I'll move this up the priority stack!!

AIlaowong commented 8 months ago

I used this acceleration solution after converting qwen to llama format. The device is 3090. This seems to be the case again in 2024.1. image

image

danielhanchen commented 8 months ago

@AIlaowong I haven't gotten around to supporting larger vocab sizes - 2024.1 still only supports 2^16 (65536) max. I'll probably work on it until I get DPO and other stuff resolved first.

danielhanchen commented 8 months ago

So what I can do temporarily since I can see Qwen specifically (and that means Deepseek) is to temporarily use Pytorch's CrossEntropyLoss if the vocab size exceeds 2^16, and for a future release I'll implement larger vocab sizes. What do you all think?

AIlaowong commented 8 months ago

That's up to you! I'm genuinely excited to see the results of your updates. It's truly uplifting to know that the training speed of LLMs is being improved.

danielhanchen commented 8 months ago

Added prelim support for ALL kernel sizes! Ie Qwen (llamified), Deepseek etc are all supported now!

danielhanchen commented 8 months ago

Closing now since it's supported for now!