rstudio / keras3

R Interface to Keras
https://keras3.posit.co/
Other
833 stars 282 forks source link

Paperspace and Keras for R: very low performance issue on GPU #272

Closed madpower2000 closed 6 years ago

madpower2000 commented 6 years ago

Hi! I have got very strange low performance issue on paperspace.com

  1. I followed step-by-step instruction from https://tensorflow.rstudio.com/tools/cloud_desktop_gpu.html
  2. Uploaded my model and data to the paperspace.com
  3. Run simple LSTM demo model like this :
model <- keras_model_sequential() %>% 
  layer_lstm(units = 32, input_shape = list(NULL, dim(data)[[-1]])) %>% 
  layer_dense(units = 1)

model %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = "mae"
)

But performance on GPU more 2x time slower when on my very old Mac CPU desktop!

Run on GPU paperspace.com instance :

2018-02-04 11:23:43.696397: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2
2018-02-04 11:23:43.843085: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-04 11:23:43.843394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Quadro P4000 major: 6 minor: 1 memoryClockRate(GHz): 1.48
pciBusID: 0000:00:05.0
totalMemory: 7.92GiB freeMemory: 7.58GiB
2018-02-04 11:23:43.843431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P4000, pci bus id: 0000:00:05.0, compute capability: 6.1)
177/177 [==============================] - 223s 1s/step - loss: 0.6681 - val_loss: 0.4896
Epoch 2/20
177/177 [==============================] - 219s 1s/step - loss: 0.6699 - val_loss: 0.4885

Run on local old CPU Mac:

Epoch 1/20
177/177 [==============================] - 102s 578ms/step - loss: 0.6701 - val_loss: 0.4890
Epoch 2/20
177/177 [==============================] - 102s 576ms/step - loss: 0.6652 - val_loss: 0.4899

As I understood from diagnostic messages, GPU was successfully detected by tensorflow, and Keras use it, but why performance so low?

mg64ve commented 6 years ago

Hi,

I have got almost similar results. Absolutely poor on P5000. Slower than my Intel i5 without GPU. I wrote an email to paperspace.com support and I did not get any reply yet. Kindly let me know if you have updates. Thanks.

jjallaire commented 6 years ago

I am not entirely certain whether an LSTM model would see a big speedup from a GPU (it certainly seems like it should but I don't have enough experience to say for sure). The script that I use to test GPU performance is the MNIST CNN sample script:

library(keras)

# Data Preparation -----------------------------------------------------

batch_size <- 128
num_classes <- 10
epochs <- 12

# Input image dimensions
img_rows <- 28
img_cols <- 28

# The data, shuffled and split between train and test sets
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

# Redefine  dimension of train/test inputs
x_train <- array_reshape(x_train, c(nrow(x_train), img_rows, img_cols, 1))
x_test <- array_reshape(x_test, c(nrow(x_test), img_rows, img_cols, 1))
input_shape <- c(img_rows, img_cols, 1)

# Transform RGB values into [0,1] range
x_train <- x_train / 255
x_test <- x_test / 255

cat('x_train_shape:', dim(x_train), '\n')
cat(nrow(x_train), 'train samples\n')
cat(nrow(x_test), 'test samples\n')

# Convert class vectors to binary class matrices
y_train <- to_categorical(y_train, num_classes)
y_test <- to_categorical(y_test, num_classes)

# Define Model -----------------------------------------------------------

# Define model
model <- keras_model_sequential() %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                input_shape = input_shape) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = num_classes, activation = 'softmax')

# Compile model
model %>% compile(
  loss = loss_categorical_crossentropy,
  optimizer = optimizer_adadelta(),
  metrics = c('accuracy')
)

# Train model
model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = epochs,
  validation_split = 0.2
)

scores <- model %>% evaluate(
  x_test, y_test, verbose = 0
)

# Output metrics
cat('Test loss:', scores[[1]], '\n')
cat('Test accuracy:', scores[[2]], '\n')

This train in 75 seconds per epoch on my relatively new MacBook Pro, and trains in 12 seconds per epoch on my Paperspace VM. What kind of performance do you see for this script on your Paperspace VM?

For reference, here are details on the machine I am using:

screen shot 2018-02-06 at 11 38 38 am
mg64ve commented 6 years ago

I am keen to know what would be the performance with a NVIDIA GTX1080 installed locally. Do you know it? In my case, if paperspace.com with P5000 is 100% or 200% slower than my laptop, why should I have to use it?

madpower2000 commented 6 years ago

@jjallaire, your CNN demo runs blazingly fast on my paperspace instance – just 6 sec per epoch paperspace__your_full_computer_in_the_cloud

madpower2000 commented 6 years ago

@jjallaire, but again, LSTM runs awfully slow on paperspace instance!

I run demo from https://keras.rstudio.com/articles/examples/stateful_lstm.html

jjallaire commented 6 years ago

Interesting, I am seeing roughly the same thing on my instance (18 seconds per epoch on Paperspace and 5 seconds per epoch locally).

I wonder if this computation is more CPU oriented and is perhaps seeing degradation on Paperspace due to CPU resource sharing across VMs?

At least we are both seeing the same thing here and convolutional models are faster (as they definitely should be) on Paperspace.

madpower2000 commented 6 years ago

After a little googling it look like there no speed advantages GPU vs CPU on LSTM/Keras/Tensorflow.

https://stackoverflow.com/questions/41948406/why-is-my-gpu-slower-than-cpu-when-training-lstm-rnn-models

I'm curious, it's an architecture bottleneck or poor implementation issue, it will be interesting compare with other backend.

But anyway, LSTM on GPU run shouldn't be x-time slower when run on CPU!

mg64ve commented 6 years ago

Apparently LTSM can't be optimized for GPU because of this sequence processing.

Il 07/feb/2018 07:29, "madpower2000" notifications@github.com ha scritto:

After a little googling it look like there no speed advantages GPU vs CPU on LSTM/Keras/Tenseoflow. https://stackoverflow.com/questions/41948406/why-is-my- gpu-slower-than-cpu-when-training-lstm-rnn-models But anyway, LSTM on GPU run shouldn't be x-time slower when run on GPU! I'm curious, it's an architecture bottleneck or poor implementation issue, it will be interesting compare with other backend.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rstudio/keras/issues/272#issuecomment-363670873, or mute the thread https://github.com/notifications/unsubscribe-auth/AVPwU7O9BClnXV9gc7wM0M6XevmSrbCSks5tSUKxgaJpZM4R4olg .

jjallaire commented 6 years ago

Okay, good to know!

mg64ve commented 6 years ago

There are 2 dedicated functions in Keras:

https://keras.rstudio.com/reference/layer_cudnn_lstm.html https://keras.rstudio.com/reference/layer_cudnn_gru.html

They use CuDNN libraries. We should test it on paperspace.com !

jjallaire commented 6 years ago

Using layer_cudnn_lstm() I am seeing 8 seconds per epoch on Paperspace (compared to 14 seconds / epoch with layer_lstm()

madpower2000 commented 6 years ago

Yes, @jjallaire I confirm your result - I've got the same 8 seconds per epoch on Paperspace using _layer_cudnnlstm(). But one more strange thing was happen! As I previously wrote above, with _layerlstm() my results was 20 seconds per epoch, but this time I've got 11 seconds - same demo code, same Paperspace instance, but 2x got spontaneily performance gain. But anyway - on my old desktop CPU - 7 seconds per epoch on same demo code.

So, bottom line for me - there no advance in performance using GPU for recurrent architecture, but perhaps it depends on model size and backed implementation, so I close this issue.

Also I think _layer_cudnnlstm() worth to be mentioned in code examples for recurrent architecture on kears.rsudio web site, because their existence not obvious.

P.S. @jjallaire, thanks for awesome presentation on RStudio conf - I really enjoy it!