GPU runtime + memory leakage

Hi @dfalbel,

I am implementing an autoregressive model and I need to do it in a for loop, but I have encountered a problem when running the model on the GPU (and CPU). For large data, there is a threshold (at time i in the loop) where the runtime suddenly increases many times and the memory starts to run full. Here is a minimal example (reproducing the example may depend on the data and the GPU):

library(torch)
device="cuda:0"
B = torch::torch_rand(size = c(1000L, 500L, 100L), device = device)
A = torch_ones_like(B, device = "cuda:0")
Parameter = torch_tensor(0.1, requires_grad = TRUE, device = device)

res = NA
for(e in 1:100) {
  print(e)
  tt = system.time({
  pred = 1-torch_sigmoid(B + (1.0-Parameter*B))
  A = A + pred
 # A$add_(pred) # in-place does not help
  })
  res[e] = tt[3]
}

plot(res, xlab = "epochs", ylab = "runtime" )

Any ideas what might be happening? (The problem (memory leakage and drop in runtime) occurs also on the CPU, but not as severely)

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] torch_0.13.0

loaded via a namespace (and not attached):
 [1] processx_3.8.0  bit_4.0.5       FINN_0.0.900    compiler_4.2.3  R6_2.5.1        magrittr_2.0.3  cli_3.6.1       tools_4.2.3     rstudioapi_0.14
[10] Rcpp_1.0.10     bit64_4.0.5     coro_1.0.3      callr_3.7.3     ps_1.7.3        rlang_1.1.3

GPU: NVIDIA A5000 Cuda: 11.7

Hi @MaximilianPi ,

I believe this is expected unfortunatelly. When building autoregressive models, since you have tensors that requires_grad in the computation, torch is storing the full computation graph in order to be able to (at some point) compute derivative of A with respect to Parameter. It's probably growing in memory usage in a exponential manner.

The problem might be more visible on GPU, because at some we try to call R's GC at every iteration trying to free some more memory. You can read more about how to tune this here: https://torch.mlverse.org/docs/articles/memory-management#cuda

Can you post how you are training your model? A common source of this is issue is that you actually need to call A$detach() at some point to avoid holding the full graph of computations.

mlverse / torch

GPU runtime + memory leakage #1184