No engine file found for LLama 3 and Cuda API error with LLama 2 with use_fp8

I am trying to accelerate the inference speed of LLama 3 8b on a 4090 using quantization. I noticed this https://github.com/huggingface/optimum-nvidia which should allow to use fp8 and have huge speed gains on a 4090.

I installed everying with pip, and am using AutoModelForCausalLM:

model = AutoModelForCausalLM.from_pretrained( model_id, use_fp8=True )

but when running this I get

[05/14/2024-13:15:23] No engine file found in /home/philippe/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/c4a54320a52ed5f88b7a2f84496903ea4ff07b45, converting and building engines
[05/14/2024-13:15:23] Defined logits dtype to: float32

so it goes back to float32 somehow.

I also tried with LLama2, and in this case I get an error message:

05/14/2024-18:49:09] Found pre-built engines at: [PosixPath('/home/philippe/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/engines')]
[...]
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][ERROR] 3: [runtime.cpp::deserializeCudaEngine::77] Error Code 3: API Usage Error (Parameter check failed at: runtime/rt/runtime.cpp::deserializeCudaEngine::77, condition: (blob) != nullptr )

I also have:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

and torch.version.cuda gives me 12.1

Has anyone any idea on how to make fp8 work? What am I missing here? Thanks a lot

huggingface / optimum-nvidia

No engine file found for LLama 3 and Cuda API error with LLama 2 with use_fp8 #128