isl-org / Open3D-ML

An extension of Open3D to address 3D Machine Learning tasks
Other
1.87k stars 321 forks source link

Difference in batch size available for PyTorch and Tensorflow on the same GPU #326

Open numahha opened 3 years ago

numahha commented 3 years ago

With changing the batch size, I tried both PyTorch and Tensorflow versions of RandLANet on SemanticKITTI. For PyTorch, I could start training with batch size 5, while I could not with batch size 6 due to CUDA out of memory error. For Tensorflow, I could with batch size 2, while I could not with batch size 3 due to error message "ResourceExhaustedError: OOM when allocating ...". So, I could only use half the batch size in Tensorflow on the same GPU. The code that I use is

import os
import open3d.ml as _ml3d
#import open3d.ml.torch as ml3d
import open3d.ml.tf as ml3d
import pprint

cfg_file = "ml3d/configs/randlanet_semantickitti.yml"
cfg = _ml3d.utils.Config.load_from_file(cfg_file)

model = ml3d.models.RandLANet(**cfg.model)
cfg.dataset['dataset_path'] = "./"
dataset = ml3d.datasets.SemanticKITTI(cfg.dataset.pop('dataset_path', None), **cfg.dataset)
pipeline = ml3d.pipelines.SemanticSegmentation(model, dataset=dataset, device="gpu", **cfg.pipeline)

pipeline.cfg_tb = {
    "readme": "readme",
    "cmd_line": "cmd_line",
    "dataset": pprint.pformat(cfg.dataset, indent=2),
    "model": pprint.pformat(cfg.model, indent=2),
    "pipeline": pprint.pformat(cfg.pipeline, indent=2),
}

pipeline.run_train()

I'd like to know where the problem lies and how to solve it.

Thanks.

sanskar107 commented 3 years ago

@numahha Thanks for reporting this. I am looking into this.

sanskar107 commented 3 years ago

@numahha Possible reason for this might be that TensorFlow prefetches the data for next iteration. It is happening here in the code (https://github.com/isl-org/Open3D-ML/blob/master/ml3d/tf/dataloaders/tf_dataloader.py#L158). Could you try again after disabling it ?

numahha commented 2 years ago

Thank you for the suggestion. I tried disabling it but got another error. This happened regardless of the batch size.

How to desable it:

        if (self.model is None or 'batcher' not in self.model_cfg.keys() 
            #or self.model_cfg.batcher == 'DefaultBatcher'
            ):
            loader = loader.batch(batch_size)

Error message when disabling it:

(o3dmltf) ub18@ub18-desktop:~/Open3D-ML$ python test_tf.py 
2021-12-03 10:20:26.671009: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-12-03 10:20:27.430333: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-12-03 10:20:27.967629: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-12-03 10:20:27.967769: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:27.968141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.785GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-12-03 10:20:27.968186: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-12-03 10:20:27.970059: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-12-03 10:20:27.970112: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-12-03 10:20:27.970766: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-12-03 10:20:27.970960: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-12-03 10:20:27.971430: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-12-03 10:20:27.972098: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-12-03 10:20:27.972220: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-12-03 10:20:27.972290: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:27.972624: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:27.972929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-12-03 10:20:27.973160: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-03 10:20:27.973464: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:27.973793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.785GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-12-03 10:20:27.973840: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:27.974138: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:27.974429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-12-03 10:20:28.369102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-12-03 10:20:28.369144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-12-03 10:20:28.369166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-12-03 10:20:28.369281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:28.369645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:28.369947: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:20:28.370241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6222 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO - 2021-12-03 10:20:28,530 - semantic_segmentation - <open3d._ml3d.tf.models.randlanet.RandLANet object at 0x7fde282a6d90>
INFO - 2021-12-03 10:20:28,530 - semantic_segmentation - Logging in file : ./logs/RandLANet_SemanticKITTI_tf/log_train_2021-12-03_10:20:28.txt
INFO - 2021-12-03 10:20:28,559 - semantickitti - Found 19130 pointclouds for training
INFO - 2021-12-03 10:20:31,531 - semantickitti - Found 4071 pointclouds for validation
INFO - 2021-12-03 10:20:32,204 - semantic_segmentation - Writing summary in train_log/00015_RandLANet_SemanticKITTI_tf.
INFO - 2021-12-03 10:20:32,205 - semantic_segmentation - Initializing from scratch.
INFO - 2021-12-03 10:20:32,205 - semantic_segmentation - === EPOCH 0/100 ===
training:   0%|                                                                                                                                        | 0/19130 [00:00<?, ?it/s]2021-12-03 10:20:32.223916: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-12-03 10:20:32.242234: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3600000000 Hz
2021-12-03 10:20:32.368198: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-12-03 10:20:32.752362: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
training:   0%|                                                                                                                                        | 0/19130 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ub18/Open3D-ML/test_tf.py", line 23, in <module>
    pipeline.run_train()
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/pipelines/semantic_segmentation.py", line 250, in run_train
    results = model(inputs, training=True)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/models/randlanet.py", line 259, in call
    f_encoder_i = self.forward_dilated_res_block(
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/models/randlanet.py", line 223, in forward_dilated_res_block
    f_pc = m_conv2d(feature, training=self.training)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/utils/helper_tf.py", line 50, in call
    x = self.conv(x)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1013, in __call__
    input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/keras/engine/input_spec.py", line 230, in assert_input_compatibility
    raise ValueError('Input ' + str(input_index) + ' of layer ' +
ValueError: Input 0 of layer conv2d_1 is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: (45056, 8, 1)

Error message when not disabling it:

(o3dmltf) ub18@ub18-desktop:~/Open3D-ML$ python test_tf.py 
2021-12-03 10:23:23.552434: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-12-03 10:23:24.305214: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-12-03 10:23:24.842576: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-12-03 10:23:24.842695: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:24.843063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.785GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-12-03 10:23:24.843114: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-12-03 10:23:24.844961: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-12-03 10:23:24.845016: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-12-03 10:23:24.845716: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-12-03 10:23:24.845909: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-12-03 10:23:24.846345: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-12-03 10:23:24.846964: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-12-03 10:23:24.847090: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-12-03 10:23:24.847163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:24.847495: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:24.847794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-12-03 10:23:24.848037: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-03 10:23:24.848393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:24.848715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.785GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-12-03 10:23:24.848762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:24.849049: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:24.849317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-12-03 10:23:25.238891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-12-03 10:23:25.238923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-12-03 10:23:25.238943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-12-03 10:23:25.239057: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:25.239391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:25.239691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-03 10:23:25.239985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6228 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO - 2021-12-03 10:23:25,397 - semantic_segmentation - <open3d._ml3d.tf.models.randlanet.RandLANet object at 0x7f6061d11d90>
INFO - 2021-12-03 10:23:25,397 - semantic_segmentation - Logging in file : ./logs/RandLANet_SemanticKITTI_tf/log_train_2021-12-03_10:23:25.txt
INFO - 2021-12-03 10:23:25,426 - semantickitti - Found 19130 pointclouds for training
INFO - 2021-12-03 10:23:28,380 - semantickitti - Found 4071 pointclouds for validation
INFO - 2021-12-03 10:23:29,046 - semantic_segmentation - Writing summary in train_log/00016_RandLANet_SemanticKITTI_tf.
INFO - 2021-12-03 10:23:29,048 - semantic_segmentation - Initializing from scratch.
INFO - 2021-12-03 10:23:29,048 - semantic_segmentation - === EPOCH 0/100 ===
training:   0%|                                                                                                                                         | 0/4783 [00:00<?, ?it/s]2021-12-03 10:23:29.066488: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-12-03 10:23:29.085860: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3600000000 Hz
2021-12-03 10:23:29.460843: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-12-03 10:23:29.849283: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-12-03 10:23:29.948460: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-12-03 10:23:30.323331: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8201
2021-12-03 10:23:40.506214: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (GPU_0_bfc) ran out of memory trying to allocate 88.00MiB (rounded to 92274688)requested by op BiasAdd
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2021-12-03 10:23:40.506317: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for GPU_0_bfc

...

2021-12-03 10:23:40.523610: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats: 
Limit:                      6531317760
InUse:                      6508513280
MaxInUse:                   6508513280
NumAllocs:                         487
MaxAllocSize:                184549376
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2021-12-03 10:23:40.523648: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ****************************************************************************************************
2021-12-03 10:23:40.523770: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at bias_op.cc:331 : Resource exhausted: OOM when allocating tensor with shape[11264,16,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
training:   0%|                                                                                                                                         | 0/4783 [00:11<?, ?it/s]
Traceback (most recent call last):
  File "/home/ub18/Open3D-ML/test_tf.py", line 23, in <module>
    pipeline.run_train()
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/pipelines/semantic_segmentation.py", line 250, in run_train
    results = model(inputs, training=True)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/models/randlanet.py", line 259, in call
    f_encoder_i = self.forward_dilated_res_block(
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/models/randlanet.py", line 225, in forward_dilated_res_block
    f_pc = self.forward_building_block(xyz, f_pc, neigh_idx, name + 'LFA')
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/models/randlanet.py", line 208, in forward_building_block
    f_pc_agg = self.forward_att_pooling(f_concat, name + 'att_pooling_1')
  File "/home/ub18/.local/lib/python3.9/site-packages/open3d/_ml3d/tf/models/randlanet.py", line 173, in forward_att_pooling
    att_activation = m_dense(f_reshaped)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/keras/layers/core.py", line 1253, in call
    outputs = nn_ops.bias_add(outputs, self.bias)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/ops/nn_ops.py", line 3377, in bias_add
    return gen_nn_ops.bias_add(
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 678, in bias_add
    _ops.raise_from_not_ok_status(e, name)
  File "/home/ub18/anaconda3/envs/o3dmltf/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[11264,16,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:BiasAdd]