For workloads such as QLoRA, we can save and upload (or use existing ones) pre-quantized model weights, which would have a couple of benefits:
Allow users to save disk space by only working with 4-bit precision checkpoints
Prevent the overhead introduced by having to run QLoRA style quantization before starting training.
This would of course come with the downside of reducing interop for these particular checkpoints (since offramp paths typically consume bf16 checkpoints), but the user would still have the option to save these in bf16 after training, reducing this concern.
This also has the added benefit of peak memory decrease during model initialization since we don't need to allocate any bf16 tensors for the quantized portions of the model.
For workloads such as QLoRA, we can save and upload (or use existing ones) pre-quantized model weights, which would have a couple of benefits:
This would of course come with the downside of reducing interop for these particular checkpoints (since offramp paths typically consume bf16 checkpoints), but the user would still have the option to save these in bf16 after training, reducing this concern.