Out-of-Memory Error when training

ellick53 commented 5 years ago

Hello,

And thanks for sharing your code! I have a dataset that contains 640x480 RGB PNG images and 640x480 masks, which have only two classes: pixel = 0 for background and pixel = 1 for the class I want to segment .There are other tiny changes like setting H=480 and W=640 in get_image().

I commented out the random crop and set this at the top:

batch_size = 32
H, W = 480, 640
num_classes = 2

For some reason I get OOM errors, even if I set batch_size to one:

TensorFlow 2.0.0-rc0
Found 5256 training images
Found 1276 validation images
2019-09-09 15:57:37.887994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-09-09 15:57:37.907360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:41:00.0
2019-09-09 15:57:37.907619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-09 15:57:37.908528: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-09 15:57:37.909426: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-09-09 15:57:37.909652: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-09-09 15:57:37.910978: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-09-09 15:57:37.912177: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-09-09 15:57:37.915010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-09 15:57:37.916366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-09-09 15:57:37.917002: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-09-09 15:57:37.943297: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3399820000 Hz
2019-09-09 15:57:37.945404: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d61de838c0 executing computations on platform Host. Devices:
2019-09-09 15:57:37.945446: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-09-09 15:57:38.038525: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d61dee5cb0 executing computations on platform CUDA. Devices:
2019-09-09 15:57:38.038561: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-09-09 15:57:38.039321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:41:00.0
2019-09-09 15:57:38.039473: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-09 15:57:38.039501: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-09 15:57:38.039528: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-09-09 15:57:38.039553: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-09-09 15:57:38.039577: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-09-09 15:57:38.039601: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-09-09 15:57:38.039625: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-09 15:57:38.041104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-09-09 15:57:38.041163: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-09 15:57:38.042289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-09 15:57:38.042307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-09-09 15:57:38.042315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-09-09 15:57:38.043814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10003 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)
<PrefetchDataset shapes: ((32, 480, 640, 3), (32, 480, 640, 1)), types: (tf.float32, tf.uint8)>
<PrefetchDataset shapes: ((32, 480, 640, 3), (32, 480, 640, 1)), types: (tf.float32, tf.uint8)>
2019-09-09 15:57:40.087335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:41:00.0
2019-09-09 15:57:40.087490: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-09 15:57:40.087520: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-09 15:57:40.087544: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-09-09 15:57:40.087569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-09-09 15:57:40.087596: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-09-09 15:57:40.087621: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-09-09 15:57:40.087647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-09 15:57:40.089121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-09-09 15:57:40.089153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-09 15:57:40.089170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-09-09 15:57:40.089178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-09-09 15:57:40.090438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 10003 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)
*** Building DeepLabv3Plus Network ***
/.../resnet/resnet50.py:265: UserWarning: The output shape of `ResNet50(include_top=False)` has been changed since Keras 2.2.0.
  warnings.warn('The output shape of `ResNet50(include_top=False)` '
*** Output_Shape => (None, 480, 640, 2) ***
Train for 164 steps, validate for 39 steps
Epoch 1/300
WARNING:tensorflow:From /environments/evo/lib/python3.7/site-packages/tensorflow_core/python/keras/layers/normalization.py:477: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2019-09-09 15:58:09.122851: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-09 15:58:11.155451: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 691.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-09-09 15:58:11.155512: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.55GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

[...]

2019-09-09 15:56:00.522795: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 2 Chunks of size 314572800 totalling 600.00MiB
2019-09-09 15:56:00.522803: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 629145600 totalling 5.27GiB
2019-09-09 15:56:00.522810: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 638353408 totalling 608.78MiB
2019-09-09 15:56:00.522818: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 9.56GiB
2019-09-09 15:56:00.522826: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 10485632000 memory_limit_: 10485632205 available bytes: 205 curr_region_allocation_bytes_: 20971264512
2019-09-09 15:56:00.522836: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                 10485632205
InUse:                 10263396352
MaxInUse:              10485617408
NumAllocs:                    3477
MaxAllocSize:           3264741376

2019-09-09 15:56:00.522866: W tensorflow/core/common_runtime/bfc_allocator.cc:424] **************************************************************************************************__
2019-09-09 15:56:00.522903: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at conv_ops.cc:501 : Resource exhausted: OOM when allocating tensor with shape[32,512,60,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2019-09-09 15:56:00.522954: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[32,512,60,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node DeepLabV3_Plus/res3a_branch2c/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[metrics/accuracy/div_no_nan/ReadVariableOp_1/_4]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

2019-09-09 15:56:00.523043: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[32,512,60,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node DeepLabV3_Plus/res3a_branch2c/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

2019-09-09 15:56:00.549836: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2019-09-09 15:56:00.550897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.0
  1/164 [..............................] - ETA: 1:24:55WARNING:tensorflow:Can save best model only with val_loss available, skipping.
2019-09-09 15:56:00.711411: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 0 kernel records, 0 memcpy records.
Traceback (most recent call last):
  File "/.vscode/extensions/ms-python.python-2019.9.34911/pythonFiles/ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "/.vscode/extensions/ms-python.python-2019.9.34911/pythonFiles/lib/python/ptvsd/__main__.py", line 432, in main
    run()
  File "/.vscode/extensions/ms-python.python-2019.9.34911/pythonFiles/lib/python/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/usr/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/.../train.py", line 140, in <module>
    callbacks=callbacks)
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 734, in fit
    use_multiprocessing=use_multiprocessing)
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
    total_epochs=epochs)
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in __call__
    return self._stateless_fn(*args, **kwds)
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1822, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/environments/evo/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[32,512,60,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node DeepLabV3_Plus/res3a_branch2c/Conv2D (defined at /environments/evo/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[metrics/accuracy/div_no_nan/ReadVariableOp_1/_4]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[32,512,60,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node DeepLabV3_Plus/res3a_branch2c/Conv2D (defined at /environments/evo/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_21997]`

Could you help me? Thanks!

srihari-humbarwadi commented 5 years ago

Strange that it throws OOM with batch size of 1. Can you try with a lower resolution and make sure that your GPU isn't being utilized by any other application? I had trained the mode on 1080ti aswell, but on 3 of them with 8 images per GPU with resolution of 512x512. Also if you are not distributing the training loop on multiple GPU's, remove the distribute strategy scope when you build the model. And here is a better repository if you are planning for binary segmentation https://github.com/srihari-humbarwadi/person_segmentation_tf2.0

ellick53 commented 5 years ago

Indeed, it works at 480x320 with batch_size = 4, thanks. As I'm trying to do a binary segmentation, I'm quite interested in the other repo. What changed, exactly? If I'm reading correctly, there are changes only to the learning rate and the data augmentation, right?

srihari-humbarwadi commented 5 years ago

@ellick53 the model is the same, but that repo better data augmentation. LR is something that you may want to tune according to your dataset. Another thing that i would like to add is please try running the code from the terminal directly and not within vscode, something like this python train.py and let me know if it still triggers OOM.

ellick53 commented 5 years ago

Thanks, I'll tell you once I've done some tests. Really nice job, btw. Are you using dice loss against class imbalance?

srihari-humbarwadi commented 5 years ago

That's only a metric, it's not included in the optimization

ellick53 commented 5 years ago

I'm a bit confused about the changes regarding classes: do you consider the background as the 'ignore' class or as a class by itself? And why did you change the softmax into a sigmoid?

srihari-humbarwadi commented 5 years ago

@ellick53 Since there are only two classes background and foreground the binary segmentation model has only 1 output channel with sigmoid activation to make sure the value of each pixel in the output channel has a value between [0, 1]. You can treat this value as the probability of the pixel belonging to the foreground class. Hope this helps!

srihari-humbarwadi / DeepLabV3_Plus-Tensorflow2.0

Out-of-Memory Error when training #8