meta-llama / LlamaGuard-7b inference speed

meta-llama / PurpleLlama

Set of tools to assess and improve LLM security.

Other

2.74k stars 453 forks source link

meta-llama / LlamaGuard-7b inference speed #23

Closed jamesoneill12 closed 7 months ago

jamesoneill12 commented 7 months ago

The huggingface meta-llama/LlamaGuard-7b model seems to be super fast at inference ~0.08-0.10 seconds single sample on an A100 80GB GPU for approx ~300 input tokens and max token generation length of 100. However, its base model Llama-2-7b isn't this fast so I'm wondering do we know if there was any tricks etc. do increase the speed, or what am I missing from the huggingface repo here ?

ujjwalkarn commented 7 months ago

Hi there, you're right that it's the same base model. The code on Hugging Face uses dtype = torch.bfloat16 which can speed up computations and reduce memory usage as compared to the more common 32-bit float32 format (torch.float32). This may be a reason you're seeing faster inference with this code. Please let us know if this does not help!

geraldstanje1 commented 5 months ago

hi, came someone tell me whats the diff between this repo and the model on huggingface?