Closed jamesoneill12 closed 7 months ago
Hi there, you're right that it's the same base model. The code on Hugging Face uses dtype = torch.bfloat16
which can speed up computations and reduce memory usage as compared to the more common 32-bit float32 format (torch.float32). This may be a reason you're seeing faster inference with this code. Please let us know if this does not help!
hi, came someone tell me whats the diff between this repo and the model on huggingface?
The huggingface meta-llama/LlamaGuard-7b model seems to be super fast at inference ~0.08-0.10 seconds single sample on an A100 80GB GPU for approx ~300 input tokens and max token generation length of 100. However, its base model Llama-2-7b isn't this fast so I'm wondering do we know if there was any tricks etc. do increase the speed, or what am I missing from the huggingface repo here ?