I am trying to accelerate the inference speed of LLama 3 8b on a 4090 using quantization. I noticed this https://github.com/huggingface/optimum-nvidia which should allow to use fp8 and have huge speed gains on a 4090.
I installed everying with pip, and am using AutoModelForCausalLM:
model = AutoModelForCausalLM.from_pretrained(
model_id,
use_fp8=True
)
but when running this I get
[05/14/2024-13:15:23] No engine file found in /home/philippe/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/c4a54320a52ed5f88b7a2f84496903ea4ff07b45, converting and building engines
[05/14/2024-13:15:23] Defined logits dtype to: float32
so it goes back to float32 somehow.
I also tried with LLama2, and in this case I get an error message:
05/14/2024-18:49:09] Found pre-built engines at: [PosixPath('/home/philippe/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/engines')]
[...]
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][ERROR] 3: [runtime.cpp::deserializeCudaEngine::77] Error Code 3: API Usage Error (Parameter check failed at: runtime/rt/runtime.cpp::deserializeCudaEngine::77, condition: (blob) != nullptr )
I also have:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
and torch.version.cuda gives me 12.1
Has anyone any idea on how to make fp8 work? What am I missing here?
Thanks a lot
I am trying to accelerate the inference speed of LLama 3 8b on a 4090 using quantization. I noticed this https://github.com/huggingface/optimum-nvidia which should allow to use fp8 and have huge speed gains on a 4090.
I installed everying with pip, and am using AutoModelForCausalLM:
model = AutoModelForCausalLM.from_pretrained( model_id, use_fp8=True )
but when running this I get
so it goes back to float32 somehow.
I also tried with LLama2, and in this case I get an error message:
I also have:
and
torch.version.cuda
gives me 12.1Has anyone any idea on how to make fp8 work? What am I missing here? Thanks a lot