[Buy hardware](https://tenstorrent.com/cards/) | [Install](./INSTALLING.md) | [Discord](https://discord.gg/tvhGzHQwaj) | [Join Us](https://boards.greenhouse.io/tenstorrent/jobs/4155609007)

**TT-NN** is a Python & C++ Neural Network OP library.

[API Reference](https://docs.tenstorrent.com/ttnn/latest/index.html) | [Model Demos](./models/demos/)

LLMs

Model	Batch	Hardware	ttft (ms)	t/s/u	Target t/s/u	t/s	Release
Falcon7B-decode	32	e150		4.2	4.4	134.4
Falcon7B	32	n150	71	17.6	26	563.2	v0.53.0-rc44
Mistral-7B	32	n150		9.9	25	316.8	v0.51.0-rc28
Mamba-2.8B	32	n150	48	12.3	41	393.6	v0.51.0-rc26
LLaMA-3.1-8B	1	n150	209	23.7	23	23.7	v0.53.0-rc44
LLaMA-3.2-1B	1	n150	72	86.4	160	86.4	v0.53.0-rc44
LLaMA-3.2-3B	1	n150	123	44.7	60	44.7	v0.53.0-rc44
Falcon7B (DP=8)	256	QuietBox	97	14.6	26	3737.6	v0.53.0-rc44
LLaMA-3.1-70B (TP=8)	32	QuietBox	190	15.1	20	483.2	v0.53.0-rc36
Falcon40B (TP=8)	32	QuietBox		5.3	36	169.6	v0.53.0-rc39
Mixtral7Bx8 (TP=8)	32	QuietBox	230	14.6	33	467.2	v0.53.0-rc44
Falcon7B (DP=32)	1024	Galaxy	242	4.4	26	4505.6	v0.53.0-rc33
LLaMA-3.1-70B (DP=4, TP=8)	128	Galaxy	190	14.3	20	1835.5	v0.52.0-rc31

Last Update: November 18, 2024

Notes:

TP = Tensor Parallel, DP = Data Parallel; Defines parallelization factors across multiple devices.

The reported LLM performance is for an input sequence length (number of rows filled in the KV cache) of 128 for all models except Mamba (which can accept any sequence length).

The t/s/u reported is the throughput of the first token generated after prefill, i.e. 1 / inter token latency.

CNNs

Model	Batch	Hardware	fps	Target fps
ResNet-50 (224x224)	20	e150	5,100	10,000
ResNet-50 (224x224)	16	n150	4,670	7,000
ResNet-50 (224x224) (DP=2)	32	n300	8,200	14,000
ResNet-50 (224x224) (DP=8)	128	QuietBox	32,250	56,000
ResNet-50 (224x224) (DP=32)	512	Galaxy	95,900	224,000
ResNet-50 (224x224) (DP=64)	1024	Two Galaxies	145,000	448,000
ViT (224x224)	9	e150	1,360	2,000
ViT (224x224)	8	n150	912	1,600
Stable Diffusion 1.4 (512x512)	1	n150	0.167	0.3
Yolo V4 (320x320)	1	n150	95	300
Segformer Semantic Segmentation (512x512)	1	n150	90	300

NLPs

Model	Batch	Hardware	sen/sec	Target sen/sec
BERT-Large	12	e150	370	410
BERT-Large	8	n150	270	400
T5 small		e150	140
Bloom		e150	70

Model Updates

For the latest model updates and features, please see MODEL_UPDATES.md

TT-NN Tech Reports

Advanced Performance Optimizations for Models (updated Oct 24th)
Programming Mesh of Devices (updated Sept 9th)
ViT Implementation in TT-NN on GS (updated Sept 22nd)
LLMs Bring up in TT-NN (updated Oct 29th)
YOLOv4 Implementation in TT-NN on WH (updated November 8th)

Benchmarks

Matrix Multiply FLOPS on WH (updated November 13th)

**TT-Metalium** is our low-level programming model, enabling kernel development for Tenstorrent hardware.

[Programming Guide](./METALIUM_GUIDE.md) | [API Reference](https://docs.tenstorrent.com/tt-metalium/latest/tt_metal/apis/index.html)

Getting started

Get started with simple kernels.

TT-Metalium Tech Reports

Matrix Engine (updated Sept 6th)
Data Formats (updated Sept 7th)
Reconfiguring Data Formats (updated Oct 17th)
Handling special floating-point numbers (updated Oct 5th)
Allocator (Updated Oct 30th)
Tensor Layouts (updated Sept 6th)
Saturating DRAM Bandwidth (updated Sept 6th)
Flash Attention on Wormhole (updated Sept 6th)
CNNs on TT Architectures (updated Sept 6th)
Ethernet and Multichip Basics (Updated Sept 20th)
Collective Communication Library (CCL) (Updated Sept 20th)
Blackhole Bring-Up Prgramming Guide (Updated Oct 30th)

TT-Metalium Programming Examples

Hello World

Hello World! Compute Kernel
Hello World! Data Movement Kernel
Add Integers
Add 2 Integers in Baby RiscV
Add 2 Integers in Compute Kernel
Simple Tensor Manipulation
Sharding
Padding
DRAM Data Movement
Dram Loopback Data Movement
Eltwise
Eltwise Unary OP in Vector Engine (SFPU)
Eltwise Binary OP in Matrix Engine (FPU)
Matmul
Matmul OP on a Single_core
Matmul OP on Multi_core (Basic)
Matmul Multi_core Reuse (Optimized)
Matmul Multi_core Multi-Cast (Optimized)

tenstorrent / tt-metal

readme