num_workers in data_loader from torch does not seem to parallelize batch loading

I want train cnns on a big dataset via transfer learning using torch in R. Since my dataset is to big to be loaded all at once, I have to load each sample from the SSD in the dataloader. But loading one batch from my SSD takes about 5-10x the time as processing (forward pass, back prop, optimizing) it. Therefore asynchronous parallel data loading would be advisable.

As far as I understand torch, this can be done in the dataloader via the num_workers - parameter. But using that did not decrease the loading time of a batch in the trainingsloop, except from introducing a big overhead before the first batch is gathered (probably there the workers are created). Now I need advise, if this can be done in torch and if I implemented anything wrong.

Example:

library(torchvision)
library(torch)

dl<-torchvision::image_folder_dataset(
  root="./data/processed/satalite_images/to_use",
  loader=function(path){
     # I have images of size 299x299 with 13 channels.
    # optimizing this loading step yielded no significant improvement.
    return(array(readRDS(path), dim=c(13,299,299))*1.0)
  },
  target_transform = function(x){a<-c(0.0,1.0)[x];dim(a)<-1;return(a)}
)
#Here I set num_workers to different numbers, but that did not change the loading time
dl2<-torch::dataloader(dl, batch_size=110L, shuffle = T, num_workers = 15L, pin_memory=T)
#just a random pretrained model for transfer learning
model_torch = torchvision::model_alexnet(pretrained = T)
model_torch$parameters |>
  purrr::walk(function(param) param$requires_grad_(FALSE))

# replacing the last layer to my desired classifier

inFeat =model_torch$classifier$'6'$in_features
model_torch$classifier$'6' = nn_linear(inFeat, out_features = 1L)

# I have 13 input channels, therefore I replace the first conv layer with a equivialent one but with 13 input channels
conv1<-torch::nn_conv2d(in_channels=13L, out_channels=model_torch[[1]]$`0`$out_channels, 
                        kernel_size =model_torch[[1]]$`0`$kernel_size , 
                        stride = model_torch[[1]]$`0`$stride,
        padding =model_torch[[1]]$`0`$padding, 
        dilation = model_torch[[1]]$`0`$dilation, groups = model_torch[[1]]$`0`$groups, bias = TRUE)
model_torch[[1]]$`0`<-conv1

model_torch<-model_torch$to(device = "cuda")
opt = optim_adam(params = model_torch$parameters, lr = 0.01)

#trainings loop
for(e in 1:1){
  losses = c()
#storing the time which the loop uses for computing and data loading
  end<-Sys.time()
  coro::loop(
    for(batch in dl2){
      start<-Sys.time()
      #this is the time it takes to load a batch
      print(start-end)
      print("computing")
      opt$zero_grad()
      pred = model_torch(batch[[1]]$to(device="cuda"))
      res=batch[[2]]$to(device = "cuda")
      loss = nnf_binary_cross_entropy(input=torch_sigmoid(pred),target=res)
      loss$backward()
      opt$step()
      losses = c(losses, loss$item())
      end<-Sys.time()
      #this is the time it takes to process a batch
      print(end-start)
      print("loading")
    }
  )
}

To my understanding the time it takes to load a batch should (after the first few batches) decrease significantly if I use parallel batch loading through num_workers compared to num_workers = 0.

But the printed time stays the same no matter the number of workers used.

I would be glad if anyone could help me!

mlverse / torch

num_workers in data_loader from torch does not seem to parallelize batch loading #1207