Cannot run `torch` intro code, see traceback attached

gsgxnet commented 3 years ago

trying to run the code chunks from the blog post:
https://github.com/rstudio/ai-blog/tree/master/_posts/2020-09-29-introducing-torch-for-r
I get reproducible fails at the central code chunk:

for (epoch in 1:5) {

  l <- c()

  for (b in enumerate(train_dl)) {
    # make sure each batch's gradient updates are calculated from a fresh start
    optimizer$zero_grad()
    # get model predictions
    output <- model(b[[1]]$to(device = "cuda"))
    # calculate loss
    loss <- nnf_cross_entropy(output, b[[2]]$to(device = "cuda"))
    # calculate gradient
    loss$backward()
    # apply weight updates
    optimizer$step()
    # track losses
    l <- c(l, loss$item())
  }

  cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))
}

Trying this with packages torch and torchvision installed today from github as suggested at the beginning of the blog.

Fehler in parent.env(x)[["batch"]][[name]] : Objekt des Typs 'symbol' ist nicht indizierbar
40.
`[[.enum_env`(b, 1)
39.
b[[1]]
38.
mget(x = c("input", "weight", "bias", "stride", "padding", "dilation", "groups"))
37.
torch_conv2d(input = input, weight = weight, bias = bias, stride = stride, padding = padding, dilation = dilation, groups = groups)
36.
nnf_conv2d(input, weight, self$bias, self$stride, self$padding, self$dilation, self$groups)
35.
self$conv_forward_(input, self$weight)
34.
self$conv1(.)
33.
mget(x = c("self"))
32.
torch_relu(input)
31.
nnf_relu(.)
30.
mget(x = c("input", "weight", "bias", "stride", "padding", "dilation", "groups"))
29.
torch_conv2d(input = input, weight = weight, bias = bias, stride = stride, padding = padding, dilation = dilation, groups = groups)
28.
nnf_conv2d(input, weight, self$bias, self$stride, self$padding, self$dilation, self$groups)
27.
self$conv_forward_(input, self$weight)
26.
self$conv2(.)
25.
mget(x = c("self"))
24.
torch_relu(input)
23.
nnf_relu(.)
22.
mget(x = c("self", "kernel_size", "stride", "padding", "dilation", "ceil_mode"))
21.
torch_max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
20.
nnf_max_pool2d(., 2)
19.
mget(x = c("input", "p", "train"))
18.
torch_feature_dropout(input, p, training)
17.
nnf_dropout2d(input, self$p, self$training, self$inplace)
16.
self$dropout1(.)
15.
mget(x = c("self", "dims", "start_dim", "end_dim", "out_dim"))
14.
torch_flatten(., start_dim = 2)
13.
nnf_linear(input, self$weight, self$bias)
12.
self$fc1(.)
11.
mget(x = c("self"))
10.
torch_relu(input)
9.
nnf_relu(.)
8.
mget(x = c("input", "p", "train"))
7.
torch_feature_dropout(input, p, training)
6.
nnf_dropout2d(input, self$p, self$training, self$inplace)
5.
self$dropout2(.)
4.
nnf_linear(input, self$weight, self$bias)
3.
self$fc2(.)
2.
x %>% self$conv1() %>% nnf_relu() %>% self$conv2() %>% nnf_relu() %>% nnf_max_pool2d(2) %>% self$dropout1() %>% torch_flatten(start_dim = 2) %>% self$fc1() %>% nnf_relu() %>% self$dropout2() %>% self$fc2()
1.
model(b[[1]]$to(device = "cuda"))

R version:

R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-suse-linux-gnu (64-bit)

R Studio 1.4.1623 (from the dailies)

rsession using cuda properly has been verified by:

torch::torch_tensor(1, device = "cuda")

and

/usr/local/cuda-10.2/extras/demo_suite> nvidia-smi
Sat Mar 13 14:29:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M1200        Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0    N/A /  N/A |   1054MiB /  4046MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     13898      C   .../lib/rstudio/bin/rsession      515MiB |
|    0   N/A  N/A     25740      C   .../lib/rstudio/bin/rsession      535MiB |
+-----------------------------------------------------------------------------+

I am do not know enough about all this to try to debug that issue by myself. I think this native torch in R package is a great way to get up to date NNs going in R. If my issue with the sample code is a general problem, it might hamper the success of the package, as it is a big hurdle for starting a torch in R journey.

gsgxnet commented 3 years ago

To corner down the error I tried a shortened loop:

for (b in enumerate(train_dl)) {
  optimizer$zero_grad()
  output <- model(b[[1]]$to(device = "cuda"))
  }

fails nearly the same way.

Fehler in mget(x = c("input", "weight", "bias", "stride", "padding", "dilation",  : 
  Versuch eine Nicht-Funktion anzuwenden 
38.
mget(x = c("input", "weight", "bias", "stride", "padding", "dilation", 
    "groups")) 
37.
torch_conv2d(input = input, weight = weight, bias = bias, stride = stride, 
    padding = padding, dilation = dilation, groups = groups) 
36.
nnf_conv2d(input, weight, self$bias, self$stride, self$padding, 
    self$dilation, self$groups) 
35.
self$conv_forward_(input, self$weight) 
34.
self$conv1(.) 
33.
mget(x = c("self")) 
32.
torch_relu(input) 
31.
nnf_relu(.) 
...

str(b) is:
Class 'enum_env' <environment: 0x55b8ff8ce788>

For me this is no pointer to the solution. Anyone?

skeydan commented 3 years ago

Hi, sorry for that.

I will update this and a few other older posts. Please instead of enumerate use coro::loop, like so:

coro::loop(for (b in train_dl) {
    optimizer$zero_grad()
    output <- model(b[[1]]$to(device = "cuda"))
  })

This is an instance of nondeterministic behavior not inherently due to torch, and does not happen when using coro for iteration.

gsgxnet commented 3 years ago

Thank you, I can confirm the modified code works fine. My adapted main epoch loop looks now (with 2 extra epochs) like:

for (epoch in 1:7) {

  l <- c()

  coro::loop(for (b in train_dl) {
    # make sure each batch's gradient updates are calculated from a fresh start
    optimizer$zero_grad()
    # get model predictions
    output <- model(b[[1]]$to(device = "cuda"))
    # calculate loss
    loss <- nnf_cross_entropy(output, b[[2]]$to(device = "cuda"))
    # calculate gradient
    loss$backward()
    # apply weight updates
    optimizer$step()
    # track losses
    l <- c(l, loss$item())
  })

  cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))
}

and when run I get:

Loss at epoch 1: 0.410523
Loss at epoch 2: 0.205173
Loss at epoch 3: 0.154265
Loss at epoch 4: 0.130127
Loss at epoch 5: 0.106498
Loss at epoch 6: 0.092929
Loss at epoch 7: 0.081178

gsgxnet commented 3 years ago

An extra comment - having modified the following code chunks as well I do get a lot better accuracy:

test_losses <- c()
total <- 0
correct <- 0

# see above
coro::loop(for (b in train_dl)  {
  output <- model(b[[1]]$to(device = "cuda"))
  labels <- b[[2]]$to(device = "cuda")
  loss <- nnf_cross_entropy(output, labels)
  test_losses <- c(test_losses, loss$item())
  # torch_max returns a list, with position 1 containing the values 
  # and position 2 containing the respective indices
  predicted <- torch_max(output$data(), dim = 2)[[2]]
  total <- total + labels$size(1)
  # add number of correct classifications in this batch to the aggregate
  correct <- correct + (predicted == labels)$sum()$item()
} )

mean(test_losses)

[1] 0.01363155

test_accuracy <-  correct/total
test_accuracy

[1] 0.9961

as far as I understand your code, this is the accuracy of the test data set. If so - now we get a nearly perfect accuracy, don't we?

rstudio / ai-blog

Cannot run `torch` intro code, see traceback attached #148