Open SahilCarterr opened 3 months ago
Hi @SahilCarterr
Quantization actually slows down the inference speed. For better speed up I would suggest use Mixed precision training and use the 16bit version for inference. It is better if you use bitsnbytes quantization as gptq quantization does not yet support merging of adopter head with primary model incase of finetuning using QLoRa
Can you make an example mixed_precision_training.ipynb
for the same?
Hi! right now I will not be able to contribute to the same. Please refer to this link https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one . Mixed precision training is a tradeoff between fast inference and memory optimization.
Can you make an example
mixed_precision_training.ipynb
for the same?
https://github.com/swastikmaiti/Meta-Llama3-8B-Chat-Instruct-LoRA.git
The inference code in
inference.ipynb
is taking 3minutes to run on Colab L4 GPU .Is their any way to speed up inference?@swastikmaiti