Nimble is a deep learning execution engine that accelerates model inference and training by running GPU tasks (i.e., GPU kernels and memory operations) in parallel with minimal scheduling overhead.
Given a PyTorch DL model, Nimble automatically generates a GPU task schedule, which employs an optimal parallelization strategy for the model.
The schedule is wrapped in a Nimble
object and can be seamlessly applied to PyTorch programs.
Nimble improves the speed of inference and training by up to 22.34× and 3.61× compared to PyTorch, respectively. Moreover, Nimble outperforms TensorRT by up to 2.81×.
Batch 32 | Batch 64 | Batch 128 |
---|---|---|
Training performance comparison on an NVIDIA V100 GPU.
This version of Nimble is built on top of PyTorch v1.7.1 with CUDA 11.0. If you want to see the old version of Nimble we used for our experiments in the paper, please checkout to main_pytorch_v1.4.1
.
Please refer to instructions to install Nimble from source.
Nimble supports both inference and training of neural networks.
import torch
import torchvision
# Instantiate a PyTorch Module and move it to a GPU
model = torchvision.models.resnet50()
model = model.cuda()
model.eval()
# Prepare a dummy input
input_shape = [1, 3, 224, 224]
dummy_input = torch.randn(*input_shape).cuda()
# Create a Nimble object
nimble_model = torch.cuda.Nimble(model)
nimble_model.prepare(dummy_input, training=False)
# Execute the object
rand_input = torch.rand(*input_shape).cuda()
output = nimble_model(rand_input)
import torch
import torchvision
BATCH = 32
# Instantiate a PyTorch Module and move it to a GPU
model = torchvision.models.resnet50(num_classes=10)
model = model.cuda()
model.train()
# Define a loss function and an optimizer
loss_fn = torch.nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# Prepare a dummy input
input_shape = [BATCH, 3, 32, 32]
dummy_input = torch.randn(*input_shape).cuda()
# Create a Nimble object
nimble_model = torch.cuda.Nimble(model)
nimble_model.prepare(dummy_input, training=True)
# Execute the forward pass
rand_input = torch.rand(*input_shape).cuda()
output = nimble_model(rand_input)
# Compute loss
label = torch.zeros(BATCH, dtype=torch.long).cuda()
loss = loss_fn(output, label)
# Execute the backward pass
loss.backward()
# Perform an optimization step
optimizer.step()
Please refer to evaluation instructions to reproduce the evaluation results.
Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, and Byung-Gon Chun (* equal contribution), Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning, 34th Conference on Neural Information Processing Systems (NeurIPS), Spotlight, December 2020.
@inproceedings{kwon2020nimble,
title={Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning},
author={Kwon, Woosuk and Yu, Gyeong-In and Jeong, Eunji and Chun, Byung-Gon},
booktitle={NeurIPS},
year={2020}
}
Create an issue for questions and bug reports.
We welcome your contributions to Nimble! We aim to create an open-source project that is contributed by the open-source community. For general discussions about development, please subscribe to nimble-discuss@googlegroups.com.