Training efficiency compared to pytorch

zhoubin-me commented 10 months ago

I was running example training cargo run --example mnist-training --features="cuda" cnn and measuring duration for one epoch, got

    Finished dev [unoptimized + debuginfo] target(s) in 0.12s
     Running `target/debug/examples/mnist-training cnn`
train-images: [60000, 784]
train-labels: [60000]
test-images: [10000, 784]
test-labels: [10000]
==>>Cuda(CudaDevice(DeviceId(1)))
   1 train loss  0.64516 test acc: 91.65% duration 7.579457078s
   2 train loss  0.40602 test acc: 94.74% duration 7.019699615s
   3 train loss  0.44834 test acc: 92.94% duration 7.448594919s

Compared to the script of pytorch with same model:


import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import time

batch_size = 64
num_classes = 10
learning_rate = 0.001
num_epochs = 10

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.1307,), std=(0.3081,))
]), download=True)

test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.1325,), std=(0.3105,))
]), download=True)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)

class LeNet5(nn.Module):
    def __init__(self, num_classes):
        super(LeNet5, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=0),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=0),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc1 = nn.Linear(1024, 1024)
        self.fc2 = nn.Linear(1024, num_classes)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        out = self.dropout(out)
        return out

model = LeNet5(num_classes).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

total_step = len(train_loader)
for epoch in range(num_epochs):
    now = time.time()
    model.train()
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        outputs = model(images)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in test_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    duration = time.time() - now
    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))
    print(f"epoch {epoch}, duration {duration}")

Got following:

Test Accuracy of the model on the 10000 test images: 98.1 %
epoch 0, duration 6.1264801025390625
Test Accuracy of the model on the 10000 test images: 98.62 %
epoch 1, duration 6.032856702804565
Test Accuracy of the model on the 10000 test images: 98.57 %
epoch 2, duration 5.995452880859375
Test Accuracy of the model on the 10000 test images: 98.28 %
epoch 3, duration 6.470236539840698
Test Accuracy of the model on the 10000 test images: 98.11 %
epoch 4, duration 5.9531824588775635
Test Accuracy of the model on the 10000 test images: 98.97 %
epoch 5, duration 5.9938647747039795

Which means both accuracy and runtime is not decent compared pytorch. In addition, pytorch only uses 20-25% GPU util, but candle uses 80-90%.

Any clue for improvement?

howard0su commented 10 months ago

notice you are using debug build. add '--release' to your command line and try again.

howard0su commented 10 months ago

also give a small learning rate like 1e-4. the default value is 1e-3.

zhoubin-me commented 10 months ago

also give a small learning rate like 1e-4. the default value is 1e-3.

Thanks @howard0su , running in release mode does improves quite a lot:

    Finished release [optimized] target(s) in 1.24s
     Running `target/release/examples/mnist-training cnn`
train-images: [60000, 784]
train-labels: [60000]
test-images: [10000, 784]
test-labels: [10000]
   1 train loss  0.48736 test acc: 95.24%, duration 3.841444833s
   2 train loss  0.21030 test acc: 96.69%, duration 3.068839951s
   3 train loss  0.18876 test acc: 96.20%, duration 3.133457776s
   4 train loss  0.17360 test acc: 96.54%, duration 3.069877785s
   5 train loss  0.16307 test acc: 96.64%, duration 3.127763016s
   6 train loss  0.16272 test acc: 97.08%, duration 3.082288287s
   7 train loss  0.15377 test acc: 96.73%, duration 3.131595679s

The test acc diff may due to data preprocessing. However, the GPU util is still too high compared to pytorch.

huggingface / candle

Training efficiency compared to pytorch #1383