rstudio-conf-2020 / dl-keras-tf

rstudio::conf(2020) deep learning workshop
Creative Commons Attribution Share Alike 4.0 International
158 stars 82 forks source link

Error running code cnn-train code chunk in 02-cats-vs-dogs.Rmd #10

Open dpastling opened 4 years ago

dpastling commented 4 years ago

When running the following code chunk from a fresh session I get the following error:

history <- model %>% fit_generator(
  train_generator,
  steps_per_epoch = 100,
  epochs = 30,
  validation_data = validation_generator,
  validation_steps = 50,
  callbacks = callback_early_stopping(patience = 5)
)
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
2020-01-28 00:41:21.650047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-28 00:41:21.887585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-28 00:41:22.620623: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

Error in py_call_impl(callable, dots$args, dots$keywords) : ResourceExhaustedError: OOM when allocating tensor with shape[6272,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node MatMul_3 (defined at /util/deprecation.py:324) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_distributed_function_1290] Function call stack: distributed_function

OmaymaS commented 4 years ago

Just adding a note that this issue is related to running code on Rstudio Server (including GPU).

dougmet commented 4 years ago

I can reproduce this problem. Investigating.

I think I'm running out of GPU memory. Which wasn't a problem before. I do have two sessions running but not sure if that's relevant.

dougmet commented 4 years ago

Dropping the batch size to 5 has got it moving again.

dpastling commented 4 years ago

I've tried dropping the batch size to 5, but am still getting errors. The code progresses through all 20 epochs, whereas it was stopping at the first with a larger batch size

> history <- 
+   model %>% 
+   fit_generator(
+     train_generator,
+     steps_per_epoch = 100,
+     epochs = 30,
+     validation_data = validation_generator,
+     validation_steps = 50,
+     callbacks = callback_early_stopping(patience = 5)
+   )
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
2020-01-29 00:01:20.648842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-29 00:01:20.827311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-29 00:01:21.513376: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 98.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

... snip ...

Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Train for 100 steps, validate for 50 steps
Epoch 1/30
100/100 [==============================] - 7s 73ms/step - loss: 0.6969 - accuracy: 0.5080 - val_loss: 0.6818 - val_accuracy: 0.5480
2020-01-29 00:01:27.273011: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Epoch 2/30
100/100 [==============================] - 3s 32ms/step - loss: 0.6927 - accuracy: 0.5180 - val_loss: 0.6750 - val_accuracy: 0.5480ernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-01-29 00:01:30.518277: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled