unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
17.97k stars 1.25k forks source link

load_in_4bit should be False by default. #588

Open ronakk-google opened 5 months ago

ronakk-google commented 5 months ago

All other libraries for language models load the model in default model quantization unless explicitly specified. https://github.com/unslothai/unsloth/blob/27fa021a7bb959a53667dd4e7cdb9598c207aa0d/unsloth/models/loader.py#L73C9-L73C21 suggests load_in_4bit is set to true even when the user doesn't explicitly asks/expects it to be. This should either be set to False or an update to the documentation should be made.

I'm willing to review any PRs for this issue.

danielhanchen commented 5 months ago

Oh we set it to True because people might not set it, then use 16bit which might overflow on their GPU

ronakk-google commented 5 months ago

I sort of see where you are going with that but my concern is that since its not documented, the assumption was that when you explicitly specify load_in_4bit=True, then the model would be setup in 4 bit. I believe that's how HF transformers, pytorch, deepspeed and all the other libraries do it. IMO the flag should either be more consistent with how others are handling it or this should be made explicitly clear in a warning or in the documentation.

danielhanchen commented 5 months ago

Fair points - might add a warning!