having problems with OOM

mustardlove commented 5 years ago

Hello, Mr.Volk Thank you very much for your nice codes! I have one question for you

I'm new to deep learning, have only basic understanding about keras codes, and currently trying to run your DSOD_train.py. Problem is, I keep getting OOM errors while executing the "Train" section of the code (error message below)

I tried to use only one GPU out of two I have, and to use 'allow_growth' option in tensorflow, and neither worked I believe I need to reduce the size of minibatch(guess your code using batch size 128, am I right?), but I have no idea where to find the code to make this change. (just changing batch_size = 26 to some number lower didn't solve the problem, so I searched your .py files, ended up with no clue)
I'd really appreciate your help on my problem

By the way, I'm using Ubuntu 16.04 and latest tensorflow-keras

------------------------------------------------error message

ResourceExhaustedError Traceback (most recent call last)

in 49 workers=1, 50 #use_multiprocessing=False, ---> 51 initial_epoch=initial_epoch) /usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs) 89 warnings.warn('Update your `' + object_name + '` call to the ' + 90 'Keras 2 API: ' + signature, stacklevel=2) ---> 91 return func(*args, **kwargs) 92 wrapper._original_function = func 93 return wrapper /usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch) 1416 use_multiprocessing=use_multiprocessing, 1417 shuffle=shuffle, -> 1418 initial_epoch=initial_epoch) 1419 1420 @interfaces.legacy_generator_methods_support /usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch) 215 outs = model.train_on_batch(x, y, 216 sample_weight=sample_weight, --> 217 class_weight=class_weight) 218 219 outs = to_list(outs) /usr/local/lib/python3.6/dist-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight) 1215 ins = x + y + sample_weights 1216 self._make_train_function() -> 1217 outputs = self.train_function(ins) 1218 return unpack_singleton(outputs) 1219 /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs) 2713 return self._legacy_call(inputs) 2714 -> 2715 return self._call(inputs) 2716 else: 2717 if py_any(is_tensor(x) for x in inputs): /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _call(self, inputs) 2673 fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata) 2674 else: -> 2675 fetched = self._callable_fn(*array_vals) 2676 return fetched[:len(self.outputs)] 2677 /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs) 1456 ret = tf_session.TF_SessionRunCallable(self._session._session, 1457 self._handle, args, -> 1458 run_metadata_ptr) 1459 if run_metadata: 1460 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr) ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[6,1376,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node batch_normalization_302/FusedBatchNorm}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[loss_5/mul/_21899]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[6,1376,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node batch_normalization_302/FusedBatchNorm}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored. ```

mvoelk commented 5 years ago

In DSOD_train.ipynb, batch size is actually 6 and the gradients get accumulated with AdamAccumulate for 128//6 batches before a gradient update is performed. This results in a virtual batch size of 126, but the log is updateted after each batch.

Setting the batch size to 4 or even 2 should solve the issue. How large is your GPU memory?

mustardlove commented 5 years ago

Thank you so much for your kind help! I changed the 512's batch size to 4 and the train code is running!

I'm using two Titan Xp GPU and the memory spec is as follows: 11.4 GbpsMemory Speed 12 GB GDDR5XStandard Memory Config 384-bitMemory Interface Width 547.7 GB/sMemory Bandwidth (GB/sec)

currently the execution is using only 1 GPU..don't know why

I have one more question!

In your data_coco.py, there is convert_to_voc function. I'm only using COCO dataset, so in DSOD_trian, I commented out codes related to VOC dataset and did gt_util_train = gt_util_coco.convert_to_voc() gt_util_val = gt_util_coco_val.convert_to_voc() Does this code make DSOD_Train to train on only 21 categories? I figured you only have 21 initial weights.

mvoelk commented 5 years ago

I've always used 1 GPU for training a model, but it should work with multiple GPUs as well. The documentation of Model.fit_generator() explains how to do this.

convert_to_voc in the COCO case returns a new GTUtility with COCO data, but with the 20 (21 including background) VOC classes leading to a model with 21 categories.

The weights you mentioned are not trainable parameters... See #14 for more details.

mustardlove commented 5 years ago

Thank you for the reply!

I played some parameters in fit_generator() (use_multiprocessing=True, workers=2) but still only one gpu was on.

I also tried using multi_gpu_model from keras.utils, but failed with _TfDeviceCaptureOp does not have method _set_device_from_string. I found that the class _TfDeviceCaptureOp in tensorflow/python/keras/backend.py does have _set_device_from_string, but in keras/backend/tensorflow_backend.py does not have that method..

If anyone solved this issue, please share your knowledge Thank you!

mvoelk commented 5 years ago

Search for keras.utils.multi_gpu_model, use_multiprocessing=True, workers=2 refers to data loading.

mvoelk / ssd_detectors

having problems with OOM #25