🤗🦙Welcome! This repository contains minimal recipes to get started quickly with Llama 3.x models, including Llama 3.1 and Llama 3.2.
This repository is WIP so that you might see considerable changes in the coming days.
[!NOTE] To use Llama 3.x, you need to accept the license and request permission to access the models. Please visit the Hugging Face repos and submit your request. You only need to do this once per collection; you'll get access to all the repos in the collection if your request is approved.
The easiest way to quickly run a Llama 🦙 on your machine would be with the
🤗 transformers
repository. Make sure you have the latest release installed.
$ pip install -U transformers
Let us conversate with an instruction tuned model.
import torch
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
llama_31 = "meta-llama/Llama-3.1-8B-Instruct" # <-- llama 3.1
llama_32 = "meta-llama/Llama-3.2-3B-Instruct" # <-- llama 3.2
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
generator = pipeline(model=llama_32, device=device, torch_dtype=torch.bfloat16)
generation = generator(
prompt,
do_sample=False,
temperature=1.0,
top_p=1,
max_new_tokens=50
)
print(f"Generation: {generation[0]['generated_text']}")
# Generation:
# [
# {'role': 'system', 'content': 'You are a helpful assistant, that responds as a pirate.'},
# {'role': 'user', 'content': "What's Deep Learning?"},
# {'role': 'assistant', 'content': "Yer lookin' fer a treasure trove o'
# knowledge on Deep Learnin', eh? Alright then, listen close and
# I'll tell ye about it.\n\nDeep Learnin' be a type o' machine
# learnin' that uses neural networks"}
# ]
Would you like to run inference of the Llama models locally? So do we! The memory requirements depend on the model size and the precision of the weights. Here's a table showing the approximate memory needed for different configurations:
Model Size | Llama Variant | BF16/FP16 | FP8 | INT4(AWQ/GPTQ/bnb) |
---|---|---|---|---|
1B | 3.2 | 2.5 GB | 1.25GB | 0.75GB |
3B | 3.2 | 6.5 GB | 3.2GB | 1.75GB |
8B | 3.1 | 16 GB | 8GB | 4GB |
70B | 3.1 | 140 GB | 70GB | 35GB |
405B | 3.1 | 810 GB | 405GB | 204GB |
[!NOTE] These are estimated values and may vary based on specific implementation details and optimizations.
Working with the capable Llama 3.1 8B models:
Working on the 🐘 big Llama 3.1 405B model:
It is often not enough to run inference on the model. Many times, you need to fine-tune the model on some custom dataset. Here are some scripts showing how to fine-tune the models.
Fine tune models on your custom dataset:
Do you want to use the smaller Llama 3.2 models to speed up text generation for bigger models? These notebooks showcase assisted decoding (speculative decoding), which gives you upto 2x speedups for text generation on Llama 3.1 70B (with greedy decoding).
Let us optimize performace shall we?
Are these models too large for you to run at home? Would you like to experiment with Llama 70B? Try out the following examples!
In addition to the generative models, Meta released two new models: Llama Guard 3 and Prompt Guard. Prompt Guard is a small classifier that detects jailbreaks and prompt injections. Llama Guard 3 is a safeguard model that can classify LLM inputs and generations. Learn how to use them as done in the following notebooks:
With the ever hungry models, the need for synthetic data generation is on the rise. Here we show you how to build your very own synthetic dataset.
Seeking an entry-level RAG pipeline? This notebook guides you through building a very simple streamlined RAG experiment using Llama and Hugging Face.
Text Generation Inference (TGI) framework enables efficient and scalable deployment of Llama models. In this notebook we'll learn how to integrate TGI for fast text generation and to consume already deployed Llama models via the Inference API:
Would you like to build a chatbot with Llama models? Here's a simple example to get you started.