But the output seems to be quant_storage_dtype = torch.bfloat16.
When i then try to merge LoRA weights with the model later:
from peft import AutoPeftModelForCausalLM
import torch
# # Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
'/my-checkpoint-40',
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
# Double-check if quantization is still effective
for name, param in merged_model.named_parameters():
print(name, param.dtype, param.shape) # This will show the dtype and shape of each parameter
Each layer is stored as quant_storage_dtype = torch.bfloat16.
The size of the safetensors added together for the fine-tuned model is 118 GB, and llama3-70b size is 127 GB. Meaning a ~7% reduction in the fine-tuned model.
Maybe a weird question but -> Is this model quantized? Is it semi-quantized? Should I quantize it further to reduce the size even more? (Need a smaller model because of the GPUs I have)
The quant_storage_dtype = torch.bfloat16 confuses me a bit.
Using this code in order to fine-tune llama3 70b on AWS GPUs. Here we use BitsAndBytesConfig to quantize the model weights and load them as float 4.
But the output seems to be quant_storage_dtype = torch.bfloat16.
When i then try to merge LoRA weights with the model later:
Each layer is stored as quant_storage_dtype = torch.bfloat16.
The size of the safetensors added together for the fine-tuned model is 118 GB, and llama3-70b size is 127 GB. Meaning a ~7% reduction in the fine-tuned model.
Maybe a weird question but -> Is this model quantized? Is it semi-quantized? Should I quantize it further to reduce the size even more? (Need a smaller model because of the GPUs I have)
The quant_storage_dtype = torch.bfloat16 confuses me a bit.