rstudio / tensorflow

TensorFlow for R
https://tensorflow.rstudio.com
Apache License 2.0
1.32k stars 317 forks source link

Mirrored strategy not working with generator #575

Closed mr-francois closed 11 months ago

mr-francois commented 1 year ago

Whenever i try to train a model with multiple GPUs and mirrored strategy, training freezes at first validation step. If i don't use validation data, training freezes after last epoch.

library(magrittr)
library(keras)
library(tensorflow)
library(reticulate)

in_shape <- 100
out_shape <- 3
batch_size <- 16

generator_dummy <- function(batch_size, in_shape, out_shape) {
  x <- array(0, dim = c(batch_size, in_shape))
  y <- array(0, dim = c(batch_size, out_shape)) 
  function() {
    return(list(x, y))
  }
}

mirrored_strategy <- tensorflow::tf$distribute$MirroredStrategy()
with(mirrored_strategy$scope(), { 

  model <- keras::keras_model_sequential(input_shape = in_shape) %>%
    layer_dense(32) %>% 
    layer_dense(out_shape)

  model %>% compile(loss = "mse",
                    optimizer = "adam")
})

gen_train <- generator_dummy(batch_size, in_shape, out_shape)
gen_val <- generator_dummy(batch_size, in_shape, out_shape)

history <-
  model %>% keras::fit(
    x = gen_train,
    validation_data = gen_val,
    steps_per_epoch = 10,
    validation_steps = 3,
    epochs = 4)

Equivalent code in python runs without problems on the same machine and conda environment.

These are my current settings:


> reticulate::py_config()
python:         ~/anaconda3/envs/env_tf_gpu/bin/python
libpython:      ~/anaconda3/envs/env_tf_gpu/lib/libpython3.7m.so
pythonhome:     ~/anaconda3/envs/env_tf_gpu:~/anaconda3/envs/env_tf_gpu
version:        3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21)  [GCC 9.4.0]
numpy:          ~/anaconda3/envs/env_tf_gpu/lib/python3.7/site-packages/numpy
numpy_version:  1.21.6

NOTE: Python version was forced by RETICULATE_PYTHON_ENV
> tensorflow::tf_config()
TensorFlow v2.11.0 ()
Python v3.7 (~/anaconda3/envs/env_tf_gpu/bin/python)
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: AlmaLinux 8.5 (Arctic Sphynx)

Matrix products: default
BLAS/LAPACK: ~/anaconda3/envs/env_tf_gpu/lib/libopenblasp-r0.3.18.so

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
  [1] compiler_4.1.1    magrittr_2.0.3    Matrix_1.5-3      whisker_0.4.1    
[5] base64enc_0.1-3   withr_2.5.0       Rcpp_1.0.10       reticulate_1.28  
[9] tensorflow_2.11.0 grid_4.1.1        jsonlite_1.8.4    tfruns_1.5.1     
[13] png_0.1-8         lattice_0.20-45  
t-kalinowski commented 1 year ago

Hi @mr-francois.

It looks like there is a deadlock when tf$distribute$MirroredStrategy is used with an R generator to generate the training data batches. Fixing this will require some investigation.

Note however that a user-defined generator will generally be the bottleneck in the training pipeline. If you can define your training pipeline using {tfdatasets}, you'll see much greater performance.

t-kalinowski commented 11 months ago

Fixed on main now.