Is Their a Way to speed Up Inference

swastikmaiti / Llama-2-7B-Chat-PEFT

PEFT is a wonderful tool that enables training a very large model in a low resource environment. Quantization and PEFT will enable widespread adoption of LLM.

https://huggingface.co/SwastikM/Llama-2-7B-Chat-text2code

3 stars 1 forks source link

Is Their a Way to speed Up Inference #1

Open SahilCarterr opened 3 months ago

SahilCarterr commented 3 months ago

The inference code in `inference.ipynb` is taking 3minutes to run on Colab L4 GPU .Is their any way to speed up inference?

@swastikmaiti

swastikmaiti commented 3 months ago

Hi @SahilCarterr

Quantization actually slows down the inference speed. For better speed up I would suggest use Mixed precision training and use the 16bit version for inference. It is better if you use bitsnbytes quantization as gptq quantization does not yet support merging of adopter head with primary model incase of finetuning using QLoRa

SahilCarterr commented 3 months ago

Can you make an example mixed_precision_training.ipynb for the same?

swastikmaiti commented 2 months ago

Hi! right now I will not be able to contribute to the same. Please refer to this link https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one . Mixed precision training is a tradeoff between fast inference and memory optimization.

swastikmaiti commented 2 months ago

Can you make an example mixed_precision_training.ipynb for the same?

https://github.com/swastikmaiti/Meta-Llama3-8B-Chat-Instruct-LoRA.git

swastikmaiti / Llama-2-7B-Chat-PEFT

Is Their a Way to speed Up Inference #1

The inference code in inference.ipynb is taking 3minutes to run on Colab L4 GPU .Is their any way to speed up inference?

The inference code in `inference.ipynb` is taking 3minutes to run on Colab L4 GPU .Is their any way to speed up inference?