Closed vidarsumo closed 2 years ago
cuda_is_available()
gives TRUE and torch_tensor(1, device = "cuda")
gives
torch_tensor
1
[ CUDAFloatType{1} ]
I tried few things, including installing CUDA 11.1 (instead of 10.2) and cuDNN 8.1.1. I followed this instructions: https://docs.nvidia.com/cuda/archive/11.1.1/cuda-quick-start-guide/index.html.
I then installed the dev version and forced 11.1 version of CUDA.
remotes::install_github("mlverse/torch")
Sys.setenv(CUDA="11.1")
library(torch)
trying URL 'https://download.pytorch.org/libtorch/cu111/libtorch-win-shared-with-deps-1.9.1%2Bcu111.zip'
Content type 'application/zip' length 3058035948 bytes (2916.4 MB)
downloaded 2916.4 MB
trying URL 'https://storage.googleapis.com/torch-lantern-builds/refs/heads/master/latest/Windows-gpu-111.zip'
Content type 'application/zip' length 1780277 bytes (1.7 MB)
downloaded 1.7 MB
I have NVIDIA Tesla P100 and it doesn't seem to be in use when I try to train e.g. TabNet.
Running this code and the CPU goes to ~90% but the GPU says at 0-1%. It's just the same speed as on my laptop.
library(torch)
library(tabnet)
library(tidymodels)
library(tidyverse)
library(modeldata)
data(credit_data)
credit_data <- credit_data %>% drop_na()
model_spec_tabnet_tune <- tabnet(
mode = "regression",
epochs = 20
) %>%
set_engine("torch", verbose = TRUE) %>%
fit(Price ~ ., data = credit_data)
Does anyone have any idea why the GPU is not in use?
Note that the GPU is working fine for other algorithms like XGBoost.
Hello @vidarsumo,
The parsnip tabnet()
function currently do not pass any device=
argument to tabnet_config()
and thus, the device used shall default to "auto". https://github.com/mlverse/tabnet/blob/f4f815f43e017ab3b6169e730b74037397041033/R/parsnip.R#L241
device="auto"
choose "cuda" when cuda is available, so you code should use the GPU as soon as it is correctly setup.
Could you confirm that the GPU is used as expected out of a workflow() in using the native tabnet_fit(Price ~ ., data = credit_data, device="cuda")
function for your model, where you can explicit device="cuda"
as parameter ?
If not, then your infrastructure setup need some fix, if yes, I'll have a deeper look at the parsnip machinery.
Hope it helps,
Hi @cregouby ,
If I use the native tabnet_fit()
the GPU load is constantly at ~2% and sometimes jumps to 20-30% for a fraction of a second and then back down to ~2% or even 0%.
How can I find out if something on my side needs to be fixed? I followed every step in the the setup guide (https://cran.r-project.org/web/packages/torch/vignettes/installation.html) and cuda_is_available() gives TRUE.
Ok, so that means that the GPU setup is correct and correctly used by the native tabnet_fit()
. But
batch_size
. In such high-end setup, you should feed the GPU with much more data !, and be able to configure it to 500e3 or 5e6. For performance reason, you should align the virtual_batch_size
accordingly (like in #79 ). The limit is the GPU RAM, so rule of thumb is keep increasing it until you get the CUDA OOM error...After fixing the 1. maybe you will see the GPU hits even with the workflow tabnet()
method.
Hope it helps
Ok so I tried to set batch_size to 2^20 and virtual_batch_size to 2^18. The GPU went up to 100% for 1-2 sec while the first epoch finished and then GPU load went down to 0% for few minutes until the GPU went back to 100% for 1-2 seconds and then down again to 0% for few minutes.
So the GPU is working for sure.
Hi, please can you share how you achieved to use %100 of your GPU with a reproducible example ? I followed all the steps in this issue and updated batch_size
and virtual_batch_size
accordingly but still using ~%2 - %0 of my GPU.
I'm trying out tabnet in R which has torch backend.
I'm using tidymodels to tune a set of hyperparameters but I'm not sure if it's using the GPU or not. This is on a Azure VM with cuda 10.2 and cudnn 7.6. and NVIDIA V100 GPU.
Running this code the GPU load is around 0%, somtimes goes to 5% but straight back down to 0% while the CPU is > 80%. Is there anything I need to do for the GPU to be used. I also tried to add dev = "cuda:0" to set_engine() but without success, i.e. GPU load still mostly around 0%.
Cuda version
Session info: