RStudio session aborted while trying to train CNN model

liu-zhiyang commented 1 year ago

Hi, I am trying to train CNN model using keras in R. I followed the example from Simple CNN on CIFAR10 dataset. Everything seems to run well but the trainning step. After run the model trainning code, RStudio crashes.

> model %>% fit(
+   x_train, y_train,
+   batch_size = batch_size,
+   epochs = epochs,
+   validation_data = list(x_test, y_test),
+   shuffle = TRUE
+ )
Epoch 1/50
2023-08-15 15:59:47.289580: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8903

I noticed that my GPU memory usage going to 100% before RStudio session was terminated. Do you have any idea about this issue? and how can I solve this problem? Thank you very much!

Here is my seesion info:

> tensorflow::tf_config()
TensorFlow v2.10.1 (C:\PROGRA~3\MINICO~1\lib\site-packages\tensorflow\__init__.p)
Python v3.9 (C:/ProgramData/miniconda3/python.exe)
> tensorflow::tf_gpu_configured()
2023-08-15 15:53:14.462317: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-15 15:53:15.386498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /device:GPU:0 with 21348 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:73:00.0, compute capability: 8.9
TensorFlow built with CUDA:  TRUE 
2023-08-15 15:53:15.393138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /device:GPU:0 with 21348 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:73:00.0, compute capability: 8.9
GPU device name:  /device:GPU:0[1] TRUE
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8  LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8 LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] keras_2.11.1

loaded via a namespace (and not attached):
 [1] R6_2.5.1          base64enc_0.1-3   Matrix_1.5-4.1    lattice_0.21-8    reticulate_1.31  
 [6] magrittr_2.0.3    png_0.1-8         generics_0.1.3    cli_3.6.1         tensorflow_2.11.0
[11] grid_4.3.1        withr_2.5.0       zeallot_0.1.0     tfruns_1.5.1      compiler_4.3.1   
[16] rstudioapi_0.15.0 tools_4.3.1       whisker_0.4.1     Rcpp_1.0.11       rlang_1.1.1      
[21] jsonlite_1.8.7    stringi_1.7.12

t-kalinowski commented 1 year ago

Can you reproduce the crash outside of RStudio (i.e., running R in cmd.exe)? Running outside the IDE occasionally gives a fuller error message giving information. With what's provided, this error could be from a variety of reasons, though my best guess is it's due to a driver or dll version mismatch.

Native GPU support on Windows is no longer supported with more recent versions of Tensorflow. 2.10 was the last release to support it (with TF 2.14 around the corner now). I would suggest migrating to Linux soon if possible. If you remain on Windows, I would encourage migrating the workflow to WSL if possible, where native GPU support continues to be officially supported.

This article may be helpful for using RStudio with WSL: https://support.posit.co/hc/en-us/articles/360049776974-Using-RStudio-Server-in-Windows-WSL2

liu-zhiyang commented 1 year ago

@t-kalinowski Thanks for your advice. While runing same code in R gui, it crashes too. I have tried to lower the version of Tensorflow to 2.6.0 and re-install cudatookit=11.2.2 and cudnn=8.1.0.77 in the conda environment(original cudatoolkit=11.8 and cudnn=8.9.3.28 were installed as .exe and .zip file manually). These changes really worked. But there were some small problem(i.e While training CNN model and even after training, the memory usage of GPU was always 100%.). I will also try RStudio under WSL. Thanks again!

rstudio / keras3

RStudio session aborted while trying to train CNN model #1372