mlverse / torch

R Interface to Torch
https://torch.mlverse.org
Other
504 stars 68 forks source link

R crash #822

Open philgoucou opened 2 years ago

philgoucou commented 2 years ago

This is probably not the finest nor precise question... but here we go.

In some applications we are bootstrapping a torch-based model (nothing too fancy, mostly some variant of a fully-connected net), so basically estimating it many times (typically <100). On most computers, R ends ups crashing/freezing at some point for some bootstrap. Something even the whole computer crashes. The bootstrap for which it fails is random, sometimes it after 3 runs, sometimes after 60, but it always crashes if we consider enough runs. Quite interestingly, we have noticed this never occurs on M1 Macs of any kind, and neither on Colab. However, it fails on other macs and the universe of PCs we tried.

If there are any insights on what might be going on, we'll gladly take it -- as this is a substantial obstacle to use sharing our torch-based code in R.

dfalbel commented 2 years ago

hi @philgoucou !

A few questions that might help tracking down the problem:

philgoucou commented 2 years ago

Thanks for your quick answers.

  1. No
  2. Indeed, we are using autograd features without detach(). Where should that be included?
  3. Yes
dfalbel commented 2 years ago

So a simple example would be:

total_loss = 0
for (i in 1:1000) {
    optimizer.zero_grad()
    output = model(input)
    loss = criterion(output)
    loss.backward()
    optimizer.step()
    total_loss = total_loss + loss
}

In this case, total loss is holding the entire history of computations because you could want to do total_loss$backward() at some point. One way of fixing is doing: total_loss = total_loss + loss$detach() or convert the loss to an R vector (loss$item() or as.numeric(loss)).

The same happens if you are storing loss values in a list, eg instead of accumulating in total_loss you were doing something like lossses <- c(lossses, loss) or something similar.

philgoucou commented 2 years ago

Unfortunately, we do nothing of the kind (except keeping the best single out-of-bag loss through loss$item() for early stopping purposes) and memory usage keeps increasing. Is there any other possible cause for this? Thanks

On Wed, May 11, 2022, 1:07 PM Daniel Falbel @.***> wrote:

So a simple example would be:

total_loss = 0 for (i in 1:1000) { optimizer.zero_grad() output = model(input) loss = criterion(output) loss.backward() optimizer.step() total_loss = total_loss + loss }

In this case, total loss is holding the entire history of computations because you could want to do total_loss$backward() at some point. One way of fixing is doing: total_loss = total_loss + loss$detach() or convert the loss to an R vector (loss$item() or as.numeric(loss)).

The same happens if you are storing loss values in a list, eg instead of accumulating in total_loss you were doing something like lossses <- c(lossses, loss) or something similar.

— Reply to this email directly, view it on GitHub https://github.com/mlverse/torch/issues/822#issuecomment-1123585567, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUWG5TNG6X2GWSTCCKPDMKTVJOIIZANCNFSM5VR3BTLQ . You are receiving this because you were mentioned.Message ID: @.***>

dfalbel commented 2 years ago

Hard to tell without seeing the code. Maybe a closure is capturing the environment where the loss value is computed and not allowing it to be garbage collected?

philgoucou commented 2 years ago

As I mentioned, it is rather plain. But here is a simplified version of the training loop:

patience = patience
wait = 0
oob_index <- c(1:x_train$size()[1])[-training_index]

best_epoch = 0
best_loss = NA

criterion = nn_mse_loss()
optimizer = optim_adam(model$parameters, lr = lr)

for (i in 1:epochs) {

  optimizer$zero_grad() # Start by setting the gradients to zero

  y_pred=model(x_train[training_index,])[[1]]
  loss=criterion(y_pred,y_train[training_index])

  y_pred_oob=model(x_train[oob_index,])[[1]]
  loss_oob=criterion(y_pred_oob,y_train[oob_index])

  percentChange <- ((best_loss - loss_oob$item())/loss_oob$item())

  # Early Stopping
  if(best_loss > loss_oob$item() | i == 1) { #best_loss > loss_oob$item()
    best_loss=loss_oob$item()
    best_epoch=i
    best_model=model

    if(percentChange > tol | i == 1) {
      wait=0
    }else {
      wait=wait+1
    }

  }else{

    wait=wait+1

  }

  if(show_train==1) {

    # Check Training
    if(i %% 1 == 0) {
      cat(" Epoch:", i, "Loss: ", loss$item(),", Val Loss: ",loss_oob$item(), "(PercentChange: ",round(percentChange,3),")", "\n")
    }

  }

  if(wait > patience) {
    if(show_train==1) {
      cat("Best Epoch at:", best_epoch, "\n")
    }
    break
  }

  loss$backward()  # Backpropagation step
  optimizer$step() # Update the parameters
}

return(best_model) # Return the model with the best val loss` 

Also, by doing some experiments, we found that memory usage indeed increases sharply with the number epochs (but not with bootstraps, and not that much with network width). Thus, it appears something is getting accumulated along the way. I can also share other bits of the code but to my understanding, this is the relevant part.

dfalbel commented 2 years ago

You might want to wrap the computation of the OOB loss in a with_no_grad context, so you don't keep the graph history for it.

with_no_grad({
  y_pred_oob=model(x_train[oob_index,])[[1]]
  loss_oob=criterion(y_pred_oob,y_train[oob_index])
})

Also if you're storing y_pred_oob somehow, it might also be leaking memory. (if wrapped in the no grad context (or $detached()ed, it would be fine to store it).

Also the line:

best_model = model

won't work correctly. Since model is a R6 like object and it's modified in-place, best_model will always be identical to the current model. You would need to use something like:

state_best_model <- lapply(module$state_dict(), function(x) x$clone())

And reload when done, with module$load_state_dict(state_best_model).

philgoucou commented 2 years ago

So, we tried both of those fixes (mostly the first), but they did not solve the issue. What is even stranger is that we found out that, on M1 Macs, memory usage is not increasing whereas it is on Windows machines and older Macs. So, the occurence of memory leakage appears to depend on the OS. Any idea what could be behind this differential?

dfalbel commented 2 years ago

I can't think on anything that could cause differences between old macs and new macs. I'm interested in that problem though, is there any way you could provide a reproducible example so I can try debugging on my side?

Thank you!

philgoucou commented 2 years ago

Sorry for the delay, we finally can share one such example and we also experimented a bit with it. The code is available here. Some observations we made:

  1. it runs under 8 minutes on a M1 mac, memory usage by R stays under 400mb the whole time
  2. it typically fails well before 50% of bootstraps are done for any intel-based computer we tried. The memory usage of R continuously increases until the session fails, despite that we detach gradients and use gc() after every bootstrap.
  3. when failure occurs, we sometimes get

There is insufficient memory for the Java Runtime Environment to continue. Native memory allocation (malloc) failed to allocate 1143696 bytes for Chunk::new Possible reasons: The system is out of physical RAM or swap space [..........]

or R the session crashes, or sometimes the computer itself.

  1. the code runs without any issue on colab
  2. the code also runs without any issue on a custom made computer with an AMD processor