Open safalabolo opened 2 months ago
I'm thinking your issue has to do with GPU memory. Based on your provided print out, it looks like you are using an NVIDIA T400 GPU with 2GB of RAM. I would first recommend trying to run the example using a smaller mini-batch size. In our classificationExampleLCAI.R example, we use a mini-batch size of 15 chips for both the training an validation sets. This is defined within torch::dataloader(). This mini-batch size may be too large for your hardware. I would recommend trying to run the code with a smaller batch size (maybe 2) to see if it will execute.
Hi,
I'm encountering a "CUDA out of memory" error when running the classificationExampleLCAI.R script.
The only modification I've made to the script is changing the setwd to correctly point to my working directory.
from R teminal:
Epoch 1/10 Errore in (function (self, min_val, max_val) : CUDA out of memory. Tried to allocate 360.00 MiB (GPU 0; 2.00 GiB total capacity; 1.62 GiB already allocated; 0 bytes free; 1.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Exception raised from malloc at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\c10\cuda\CUDACachingAllocator.cpp:936 (most recent call first): 00007FF9B18BD24200007FF9B18BD1E0 c10.dll!c10::Error::Error [ @ ]
00007FF9B1884DF500007FF9B1884D80 c10.dll!c10::OutOfMemoryError::OutOfMemoryError [ @ ]
00007FF9ABC5DDFC00007FF9ABC5C490 c10_cuda.dll!c10::cuda::CUDAStream::id [ @ ]
00007FF9ABC5DE8700007FF9ABC5C490 c10_cuda.dll!c10::cuda::CUDAStream::id [ @ ]
00007FF9ABC5829400007FF9ABC51D60 c10_cuda.dll!c10::Fre
Chiamate: fit ... call_c_function -> do_call -> do.call -> **
I consulted the Torch memory management documentation and tried several suggestions related to CUDA settings, but unfortunately, none of them helped resolve the issue.
Are there any additional suggestions or ways to handle this memory issue, or perhaps a recommended method to skip or handle the error more gracefully without stopping the execution entirely? Thank!
P.S.
1) I followed the installation instructions provided here and installed CUDA 11.3 and cuDNN 8.4 as per the support matrix.
2) Running the following command: C:\Users\Utente>nvidia-smi
gives the following output:
Mon Sep 23 15:05:08 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 516.01 Driver Version: 516.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA T400 WDDM | 00000000:01:00.0 On | N/A | | 38% 49C P8 N/A / 31W | 335MiB / 2048MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3548 C+G ...Files\RStudio\rstudio.exe N/A | | 0 N/A N/A 5744 C+G ...me\Application\chrome.exe N/A | | 0 N/A N/A 6428 C+G ...y\ShellExperienceHost.exe N/A | | 0 N/A N/A 6628 C+G ...wekyb3d8bbwe\Video.UI.exe N/A | | 0 N/A N/A 9120 C+G ...5n1h2txyewy\SearchApp.exe N/A | | 0 N/A N/A 9224 C+G ...ge\Application\msedge.exe N/A | | 0 N/A N/A 9640 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 10088 C+G ...artMenuExperienceHost.exe N/A | | 0 N/A N/A 10268 C+G ...qxf38zg5c\Skype\Skype.exe N/A | | 0 N/A N/A 10716 C+G ...me\Application\chrome.exe N/A | | 0 N/A N/A 11716 C+G ...Spark\CiscoCollabHost.exe N/A | | 0 N/A N/A 12120 C+G ...oft\OneDrive\OneDrive.exe N/A | | 0 N/A N/A 13640 C+G ...lPanel\SystemSettings.exe N/A | | 0 N/A N/A 14116 C+G ...2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 14284 C+G ...3d8bbwe\CalculatorApp.exe N/A | | 0 N/A N/A 15040 C+G ...e\PhoneExperienceHost.exe N/A | | 0 N/A N/A 15760 C+G ...5n1h2txyewy\SearchApp.exe N/A | | 0 N/A N/A 16184 C+G ...739.79\msedgewebview2.exe N/A | +-----------------------------------------------------------------------------+
3) Running the nvcc --version command shows:
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_19:00:59_Pacific_Daylight_Time_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0
4) I tried running the example provided here, but RStudio didn’t crash after running the fit part.
5) I am using CUDA, and the cuda_is_available() command returns TRUE.
6) Also, this simple code runs without issues:
torch_tensor(1, device = "cuda")
which outputs:
torch_tensor 1 [ CUDAFloatType{1} ]
7) Session Info:
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt) Platform: x86_64-w64-mingw32/x64 Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale: [1] LC_COLLATE=Italian_Italy.utf8 LC_CTYPE=Italian_Italy.utf8 LC_MONETARY=Italian_Italy.utf8 LC_NUMERIC=C LC_TIME=Italian_Italy.utf8
time zone: Europe/Rome tzcode source: internal
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] ggplot2_3.5.1 luz_0.4.0 torch_0.13.0 dplyr_1.1.4 geodl_0.2.0
loaded via a namespace (and not attached): [1] crayon_1.5.3 terra_1.7-78 vctrs_0.6.5 cli_3.6.3 zeallot_0.1.0 rlang_1.1.4 processx_3.8.4 generics_0.1.3 coro_1.0.4 glue_1.7.0
[11] bit_4.5.0 prettyunits_1.2.0 colorspace_2.1-1 ps_1.8.0 hms_1.1.3 scales_1.3.0 fansi_1.0.6 grid_4.4.1 munsell_0.5.1 tibble_3.2.1
[21] progress_1.2.3 lifecycle_1.0.4 compiler_4.4.1 codetools_0.2-20 fs_1.6.4 Rcpp_1.0.13 pkgconfig_2.0.3 rstudioapi_0.16.0 R6_2.5.1 tidyselect_1.2.1 [31] utf8_1.2.4 pillar_1.9.0 callr_3.7.6 magrittr_2.0.3 withr_3.0.1 gtable_0.3.5 tools_4.4.1 bit64_4.0.5
8) Enviroment variables:
Sys.getenv()
ALLUSERSPROFILE C:\ProgramData APPDATA C:\Users\Utente\AppData\Roaming CLICOLOR_FORCE 1 CommonProgramFiles C:\Program Files\Common Files CommonProgramFiles(x86) C:\Program Files (x86)\Common Files CommonProgramW6432 C:\Program Files\Common Files COMPUTERNAME DESKTOP-R4FSQGB ComSpec C:\Windows\system32\cmd.exe CUDA_MODULE_LOADING LAZY CUDA_PATH C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7 CUDA_PATH_V11_7 C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7 CURL_CA_BUNDLE C:/PROGRA~1/R/R-44~1.1/etc/curl-ca-bundle.crt DISPLAY :0 DriverData C:\Windows\System32\Drivers\DriverData GFORTRAN_STDERR_UNIT -1 GFORTRAN_STDOUT_UNIT -1 GIT_ASKPASS rpostback-askpass HOME C:\Users\Utente\Documents HOMEDRIVE C: HOMEPATH \Users\Utente JD2_HOME C:\Users\Utente\AppData\Local\JDownloader 2.0 LOCALAPPDATA C:\Users\Utente\AppData\Local LOGONSERVER \DESKTOP-R4FSQGB MPLENGINE tkAgg MSYS2_ENV_CONV_EXCL R_ARCH NUMBER_OF_PROCESSORS 16 NVTOOLSEXT_PATH C:\Program Files\NVIDIA Corporation\NvToolsExt\ OneDrive C:\Users\Utente\OneDrive OneDriveConsumer C:\Users\Utente\OneDrive ORIGINAL_XDG_CURRENT_DESKTOP undefined OS Windows_NT PATH C:\rtools44\x86_64-w64-mingw32.static.posix\bin;C:\rtools44\usr\bin;C:\Program Files\R\R-4.4.1\bin\x64;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\libnvvp;C:\Program Files\NVIDIA\CUDNN\v9.4\bin;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\PuTTY\;C:\Program Files\dotnet\;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2022.2.0\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Users\Utente\AppData\Local\Programs\Python\Python312\Scripts\;C:\Users\Utente\AppData\Local\Programs\Python\Python312\;C:\Users\Utente\AppData\Local\Programs\Python\Launcher\;C:\Users\Utente\AppData\Local\Microsoft\WindowsApps;C:\Users\Utente\anaconda3;C:\Users\Utente\anaconda3\Scripts;C:\Users\Utente\anaconda3\condabin;C:\Program Files\snap\bin;C:\Users\Utente.dotnet\tools;C:\Program Files\RStudio\resources\app\bin\quarto\bin;C:\Program Files\RStudio\resources\app\bin\postback PATHEXT .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC PROCESSOR_ARCHITECTURE AMD64 PROCESSOR_IDENTIFIER Intel64 Family 6 Model 167 Stepping 1, GenuineIntel PROCESSOR_LEVEL 6 PROCESSOR_REVISION a701 ProgramData C:\ProgramData ProgramFiles C:\Program Files ProgramFiles(x86) C:\Program Files (x86) ProgramW6432 C:\Program Files PSModulePath C:\Program Files\WindowsPowerShell\Modules;C:\Windows\system32\WindowsPowerShell\v1.0\Modules PUBLIC C:\Users\Public PYTHONIOENCODING utf-8 R_ARCH /x64 R_CLI_HAS_HYPERLINK_IDE_HELP true R_CLI_HAS_HYPERLINK_IDE_RUN true R_CLI_HAS_HYPERLINK_IDE_VIGNETTE true R_COMPILED_BY gcc 13.2.0 R_DOC_DIR C:/PROGRA~1/R/R-44~1.1/doc R_HOME C:/PROGRA~1/R/R-44~1.1 R_INCLUDE_DIR C:/PROGRA~1/R/R-44~1.1/include R_LIBS_SITE C:/PROGRA~1/R/R-44~1.1/site-library R_LIBS_USER C:\Users\Utente\AppData\Local/R/win-library/4.4 R_PLATFORM
R_RTOOLS44_PATH C:\rtools44/x86_64-w64-mingw32.static.posix/bin;C:\rtools44/usr/bin R_RUNTIME ucrt R_SHARE_DIR C:/PROGRA~1/R/R-44~1.1/share R_USER C:/Users/Utente/Documents RMARKDOWN_MATHJAX_PATH C:/Program Files/RStudio/resources/app/resources/mathjax-27 RS_LOCAL_PEER \.\pipe\29631-rsession RS_LOG_LEVEL WARN RS_RPOSTBACK_PATH C:/Program Files/RStudio/resources/app/bin/rpostback.exe RS_SHARED_SECRET 5115456a-4cea-468e-9afa-80183f5f6779 RSTUDIO 1 RSTUDIO_CLI_HYPERLINKS true RSTUDIO_CONSOLE_COLOR 256 RSTUDIO_CONSOLE_WIDTH 200 RSTUDIO_DESKTOP_EXE C:\Program Files\RStudio\rstudio.exe RSTUDIO_MSYS_SSH C:/Program Files/RStudio/resources/app/bin/msys-ssh-1000-18 RSTUDIO_PANDOC C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools RSTUDIO_PROGRAM_MODE desktop RSTUDIO_SESSION_PID 9196 RSTUDIO_SESSION_PORT 29631 RSTUDIO_USER_IDENTITY Utente RSTUDIO_WINUTILS C:/Program Files/RStudio/resources/app/bin/winutils RTOOLS40_HOME C:\RBuildTools\4.0 RTOOLS44_HOME C:\rtools44 SESSIONNAME Console SSH_ASKPASS rpostback-askpass SystemDrive C: SystemRoot C:\Windows TEMP C:\Users\Utente\AppData\Local\Temp TERM xterm-256color TMP C:\Users\Utente\AppData\Local\Temp USERDOMAIN DESKTOP-R4FSQGB USERDOMAIN_ROAMINGPROFILE DESKTOP-R4FSQGB USERNAME Utente USERPROFILE C:\Users\Utente windir C:\Windows
9) My nvcc path:
system("which nvcc", intern = TRUE)
returns:
[1] "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.7/bin/nvcc"