Closed fmannhardt closed 6 years ago
To help you reproduce the issue, I tried running the 'mnist_mlp.R' example that also uses a 'standard_gpu' instances. Whereas it does not fail, it shows the same error message regarding the cudnn library version mismatches:
I master-replica-0 2018-07-25 08:19:20.189041: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.111.0 master-replica-0
I master-replica-0 2018-07-25 08:19:20.189094: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 390.46.0 master-replica-0
This is more of the Cloud ML log around the relevant lines, maybe it helps finding what is the issue:
I master-replica-0 > x_train <- mnist$train$x master-replica-0
I master-replica-0 > y_train <- mnist$train$y master-replica-0
I master-replica-0 > x_test <- mnist$test$x master-replica-0
I master-replica-0 > y_test <- mnist$test$y master-replica-0
I master-replica-0 > x_train <- array_reshape(x_train, c(nrow(x_train), master-replica-0
I master-replica-0 + 784)) master-replica-0
I master-replica-0 > x_test <- array_reshape(x_test, c(nrow(x_test), 784)) master-replica-0
I master-replica-0 > x_train <- x_train/255 master-replica-0
I master-replica-0 > x_test <- x_test/255 master-replica-0
I master-replica-0 > y_train <- to_categorical(y_train, 10) master-replica-0
I master-replica-0 > y_test <- to_categorical(y_test, 10) master-replica-0
I master-replica-0 > model <- keras_model_sequential() %>% layer_dense(units = FLAGS$dense_units1, master-replica-0
I master-replica-0 + activation = "relu", input_shape = c(784)) %>% layer_dropout(ra .... [TRUNCATED] master-replica-0
I master-replica-0 > model %>% compile(loss = "categorical_crossentropy", master-replica-0
I master-replica-0 + optimizer = optimizer_rmsprop(), metrics = c("accuracy")) master-replica-0
I master-replica-0 > model %>% fit(x_train, y_train, epochs = 20, batch_size = 128, master-replica-0
I master-replica-0 + validation_split = 0.2) master-replica-0
I master-replica-0 2018-07-25 08:19:20.188813: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE master-replica-0
I master-replica-0 2018-07-25 08:19:20.188938: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: cmle-training-master-50f73c6a29-0-6czsf master-replica-0
I master-replica-0 2018-07-25 08:19:20.188971: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: cmle-training-master-50f73c6a29-0-6czsf master-replica-0
I master-replica-0 2018-07-25 08:19:20.189041: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.111.0 master-replica-0
I master-replica-0 2018-07-25 08:19:20.189094: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 390.46.0 master-replica-0
I master-replica-0 Attempting refresh to obtain initial access_token master-replica-0
I master-replica-0 2018-07-25 08:19:20.189112: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 390.46.0Wed Jul 25 08:19:24 2018 master-replica-0
I master-replica-0 +-----------------------------------------------------------------------------+ master-replica-0
I master-replica-0 | NVIDIA-SMI 384.111 Driver Version: 390.46 | master-replica-0
I master-replica-0 |-------------------------------+----------------------+----------------------+ master-replica-0
I master-replica-0 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | master-replica-0
I master-replica-0 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | master-replica-0
I master-replica-0 |===============================+======================+======================| master-replica-0
I master-replica-0 | 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 | master-replica-0
I master-replica-0 | N/A 32C P8 25W / 149W | 0MiB / 11441MiB | 0% Default | master-replica-0
I master-replica-0 +-------------------------------+----------------------+----------------------+ master-replica-0
I master-replica-0 master-replica-0
I master-replica-0 +-----------------------------------------------------------------------------+ master-replica-0
I master-replica-0 | Processes: GPU Memory | master-replica-0
I master-replica-0 | GPU PID Type Process name Usage | master-replica-0
I master-replica-0 |=============================================================================| master-replica-0
I master-replica-0 | No running processes found | master-replica-0
I master-replica-0 +-----------------------------------------------------------------------------+ master-replica-0
I master-replica-0 does not match DSO version 384.111.0 -- cannot find working devices in this configuration master-replica-0
I master-replica-0 Train on 48000 samples, validate on 12000 samples master-replica-0
I master-replica-0 Epoch 1/20 master-replica-0
I master-replica-0 - 2s - loss: 0.6001 - acc: 0.8150 - val_loss: 0.2118 - val_acc: 0.9374 master-replica-0
These are lower level CloudML GPU provisioning errors. I would take these to the normal CloudML support channels and see what the diagnosis is (pls link any correspondence back here though).
Ok thanks, I was not sure if you are using some custom image that might cause the version mismatch between libcuda and the driver. I will report it to the CloudML support.
No, there is no custom image for R, it uses the same base TF image as the rest of CloudML.
For reference, I reported it here: https://issuetracker.google.com/issues/111815849
The issue has been confirmed and solved by Google. After updating my Google Cloud SDK the GPU is now used properly.
I am trying to run Tensorflow via cloudml using the CuDNNLSTM layer, but the CUDA version seem not to match on the image provided by R cloudml.
I get the following error messages:
Also afterwards:
In the beginning of the log, I see that a GPU is provisioned:
And
Is this a problem with R cloudml or Google?