Support loading of pre-quantized models

pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning

BSD 3-Clause "New" or "Revised" License

4.06k stars 376 forks source link

For workloads such as QLoRA, we can save and upload (or use existing ones) pre-quantized model weights, which would have a couple of benefits:

Allow users to save disk space by only working with 4-bit precision checkpoints
Prevent the overhead introduced by having to run QLoRA style quantization before starting training.

This would of course come with the downside of reducing interop for these particular checkpoints (since offramp paths typically consume bf16 checkpoints), but the user would still have the option to save these in bf16 after training, reducing this concern.

pytorch / torchtune

Support loading of pre-quantized models #1041