mlverse / torch

R Interface to Torch
https://torch.mlverse.org
Other
484 stars 64 forks source link

Slow train loop compared to python #694

Open mohamed-180 opened 2 years ago

mohamed-180 commented 2 years ago

With trivial example of approximating trigonometric function (sin) in python :

## Imports
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
from torch import tensor
from torch import nn
from torch.utils.data.dataset import TensorDataset
from torch.utils.data import DataLoader
from time import time
from tqdm import tqdm

## Data
x = torch.linspace(-6,6,1000).reshape(-1,1)
y = torch.cos(x)
ds = TensorDataset(x,y)
dls = DataLoader(ds, batch_size=64, shuffle=True)

## Model
model = nn.Sequential(nn.Linear(1,4), nn.Sigmoid(), nn.Linear(4,1))
opt = torch.optim.Adam(model.parameters(), lr = .01)

## Train Step
def train_step(model, dls):
    epoch_loss = 0
    for xtrn , ytrn in dls:
        loss = F.mse_loss(model(xtrn), ytrn)
        opt.zero_grad()
        loss.backward()
        opt.step()

        epoch_loss += loss.item()
    return epoch_loss

## Train Modle
l = 0
for epoch in tqdm(range(100)):
    l += train_step(model, dls)
print(l/100)

and the same implementation in R :

library(torch)

x <- torch_linspace(-6,6,1000)$reshape(c(-1,1))
y <- torch_sin(x)

ds <- tensor_dataset(x,y)
dls <- dataloader(ds, 64L, shuffle = TRUE)

model <- nn_sequential(nn_linear(1,16), nn_sigmoid(), nn_linear(16,1))
opt <- optim_adam(model$parameters, lr=.01)

for (epoch in 1:100){
  l <- 0
  coro::loop(for (b in dls) {
    loss = nnf_mse_loss(model(b[[1]]) , b[[2]])
    opt$zero_grad()
    loss$backward()
    opt$step()
    l <- l + loss$item()
  })

  l
}

the deference in time is observable , don't know why !!!

Screen Shot 2021-09-24 at 7 56 44 PM
mohamed-180 commented 2 years ago

After dropping the use of dataloader and as workaround use manual shuffling and batching :

library(torch)

x <- torch_linspace(-6,6,1000)$reshape(c(-1,1))
y <- torch_sin(x)
model <- nn_sequential(nn_linear(1,16), nn_sigmoid(), nn_linear(16,1))
opt <- optim_adam(model$parameters, lr=.01)
#--------------------------
#  shuffling and batching 👇 
#--------------------------
ind <- torch_randperm(length(x)) + 1L # indexing must start from 1 not zero
ind <- split(ind,ceiling(seq_along(ind)/64))

for (epoch in 1:100){
  l <- 0
  for(b in ind) {
    loss = nnf_mse_loss(model(x[b]) , y[b])
    opt$zero_grad()
    loss$backward()
    opt$step()
    l <- l + loss$item()
  }
  l
}

and getting half time 😮

Screen Shot 2021-09-24 at 11 35 30 PM
dfalbel commented 2 years ago

Yes, this is still kind of expected. We still need to make performance improvements in the R side. Specially related to dataloading and in the optimizers code.

However, small examples are likely to show higher differences because code is probably spending more time in R/Python code than in the efficient C++ libtorch code that both the Python and the R packages share.

mohamed-180 commented 2 years ago

that is true the optimizer step function takes a lot of time while the training loop 👇

Screen Shot 2021-09-25 at 1 20 19 AM
cgorac commented 1 year ago

I'm new to Torch for R, haven't found a reference to a mailing list where it would be maybe more appropriate to discuss this, so I'll leave my comment here.

Am also rather surprised on how slow the training is in Torch for R when compared to Python. So I have following hello-world-MNIST code on Python:

import torch

from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

DATADIR = "data"
BATCH_SIZE = 64
EPOCHS = 5

class Mnist(nn.Module):
    def __init__(self):
        super(Mnist, self).__init__()
        self.flatten = nn.Flatten()
        self.sequential = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.sequential(x)
        return logits

def train(dataloader, model, loss_fn, optimizer, device):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        pred = model(X)
        loss = loss_fn(pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn, device):
    size = len(dataloader.dataset)
    model.eval()
    test_loss, correct = 0, 0
    num_batches = 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)

            pred = model(X)
            loss = loss_fn(pred, y).item()

            test_loss += loss
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
            num_batches += 1
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

if __name__ == "__main__":
    torch.manual_seed(1)

    device = "cuda" if torch.cuda.is_available() else "cpu"

    train_data = datasets.MNIST(
        root=DATADIR,
        train=True,
        download=True,
        transform=ToTensor()
    )

    test_data = datasets.MNIST(
        root=DATADIR,
        train=False,
        download=True,
        transform=ToTensor()
    )

    train_dataloader= DataLoader(train_data, batch_size = BATCH_SIZE)
    test_dataloader = DataLoader(test_data, batch_size = BATCH_SIZE)

    model = Mnist().to(device)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.RMSprop(model.parameters())

    for t in range(EPOCHS):
        print(f"Epoch {t+1}")
        print("-------------------------------")
        train(train_dataloader, model, loss_fn, optimizer, device)
        test(test_dataloader, model, loss_fn, device)
        print("Done!")

and then what I think is an equivalent in R:

library(magrittr)
library(torch)
library(luz)
library(torchvision)

DATADIR <- "data"
BATCH_SIZE <- 64
EPOCHS <- 5

train_ds <- mnist_dataset(
  DATADIR,
  download = TRUE,
  transform = transform_to_tensor
)

test_ds <- mnist_dataset(
  DATADIR,
  train = FALSE,
  transform = transform_to_tensor
)

train_dl <- dataloader(train_ds, batch_size = BATCH_SIZE)
test_dl <- dataloader(test_ds, batch_size = BATCH_SIZE)

model <- nn_module(
  "Mnist",

  initialize = function() {
    self$sequential <- nn_sequential(
      nn_linear(28 * 28, 512),
      nn_relu(),
      nn_linear(512, 10)
    )
  },

  forward = function(x) {
    x %>%
      torch_flatten(start_dim = 2) %>%
      self$sequential()
  }
)

fitted <- model %>%
  setup(
    loss = nn_cross_entropy_loss(),
    optimizer = optim_rmsprop,
    metrics = list(
      luz_metric_accuracy()
    )
  ) %>%
  fit(train_dl, epochs = EPOCHS, valid_data = test_dl)

I did runs in both cases with training/test data already downloaded. My GPU is rather old Quadro P3000, still adequate enough for this small problem. Python code takes about 33s to train, while R code takes about 2020s, so about 60x slower. Note that torch::cuda_is_available() returns TRUE, and judging from nvidia-smi output my GPU is busy during running R code, in the sense that I could see increased GPU memory usage approximately alike to when Python code run. The amount of GPU computations should have been exactly the same in both cases: same amount of training/test data, same networks thus same number of coefficients to optimize; so I guess the difference comes from R code from Torch for R. I read the documentation, and I understood for example optimizers and other segments of Torch are implemented in R, but I don't understand why is that i.e. why corresponding code from libtorch is not also wrapped for use in R?

For the record, C++ libtorch code, completely equivalent to above Python code, takes 5s to train. Considering that most of performance-critical code in PyTorch is actually shared with libtorch, the PyTorch performance is rather disappointing too. On the other side, admittedly, C++ code takes twice as long to compile than to run on my laptop: the code itself has about 120 lines, but as libtorch is heavily templated, the preprocessed file actually sent to compiler is about 320k lines long.

(As an additional note, equivalent Keras R code takes about 15s to train.)

cmcclellan commented 1 year ago

I want to +1 on this comment. I am very interested moving to this package, but the training times I'm experiencing are orders of magnitude larger than with Keras/Tensorflow. This moves the torch package from the go-to package to something I may dust off every once in a while for a edge case. Hope this can be optimized soon.