Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

GorkemP commented 4 years ago

System information

What is the top-level directory of the model you are using: models/research
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
TensorFlow installed from (source or binary): binary (using pip)
TensorFlow version (use command below): 1.14
Bazel version (if compiling from source):
CUDA/cuDNN version: 10.1 / 7.4.2
GPU model and memory: 2 RTX 2080 Super
Exact command to reproduce:

Describe the problem

I am running object detection API on custom dataset, tensorflow object detection API is installed correctly, test is passed. Dataset is converted to tfrecord format and pipeline configuration is ok.

There is a problem in starting cuDNN handle, which results in failed to get convolution algorithms.

Through my search, there may be several reasons:

session_config.gpu_options.allow_growth = True must be set in RTX card. If this is the case, what is the proper way to fix this. In which file (I guess trainer.py) this fix should be placed?
2 GPUs may cause problems

Source code / logs

/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/slim-0.1-py3.6.egg/nets/inception_resnet_v2.py:374: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/slim-0.1-py3.6.egg/nets/mobilenet/mobilenet.py:397: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From object_detection/model_main.py:109: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/utils/config_util.py:102: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W1224 10:32:11.309172 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/utils/config_util.py:102: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:628: The name tf.logging.warning is deprecated. Please use tf.compat.v1.logging.warning instead.

W1224 10:32:11.311061 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:628: The name tf.logging.warning is deprecated. Please use tf.compat.v1.logging.warning instead.

WARNING:tensorflow:Forced number of epochs for all eval validations to be 1. W1224 10:32:11.311156 140370471044928 model_lib.py:629] Forced number of epochs for all eval validations to be 1. WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/utils/config_util.py:488: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W1224 10:32:11.311218 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/utils/config_util.py:488: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Maybe overwriting train_steps: 50000 I1224 10:32:11.311259 140370471044928 config_util.py:488] Maybe overwriting train_steps: 50000 INFO:tensorflow:Maybe overwriting use_bfloat16: False I1224 10:32:11.311299 140370471044928 config_util.py:488] Maybe overwriting use_bfloat16: False INFO:tensorflow:Maybe overwriting sample_1_of_n_eval_examples: 1 I1224 10:32:11.311335 140370471044928 config_util.py:488] Maybe overwriting sample_1_of_n_eval_examples: 1 INFO:tensorflow:Maybe overwriting eval_num_epochs: 1 I1224 10:32:11.311372 140370471044928 config_util.py:488] Maybe overwriting eval_num_epochs: 1 INFO:tensorflow:Maybe overwriting load_pretrained: True I1224 10:32:11.311407 140370471044928 config_util.py:488] Maybe overwriting load_pretrained: True INFO:tensorflow:Ignoring config override key: load_pretrained I1224 10:32:11.311442 140370471044928 config_util.py:498] Ignoring config override key: load_pretrained WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1. W1224 10:32:11.311750 140370471044928 model_lib.py:645] Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1. INFO:tensorflow:create_estimator_and_inputs: use_tpu False, export_to_tpu False I1224 10:32:11.311802 140370471044928 model_lib.py:680] create_estimator_and_inputs: use_tpu False, export_to_tpu False INFO:tensorflow:Using config: {'_model_dir': 'ead/models/model_inception_resnet_v2_atrous', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7faa180b0048>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} I1224 10:32:11.312070 140370471044928 estimator.py:209] Using config: {'_model_dir': 'ead/models/model_inception_resnet_v2_atrous', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7faa180b0048>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} WARNING:tensorflow:Estimator's model_fn (<function create_model_fn..model_fn at 0x7faa1809b0d0>) includes params argument, but params are not passed to Estimator. W1224 10:32:11.312211 140370471044928 model_fn.py:630] Estimator's model_fn (<function create_model_fn..model_fn at 0x7faa1809b0d0>) includes params argument, but params are not passed to Estimator. INFO:tensorflow:Not using Distribute Coordinator. I1224 10:32:11.312549 140370471044928 estimator_training.py:186] Not using Distribute Coordinator. INFO:tensorflow:Running training and evaluation locally (non-distributed). I1224 10:32:11.312638 140370471044928 training.py:612] Running training and evaluation locally (non-distributed). INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600. I1224 10:32:11.312759 140370471044928 training.py:700] Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600. WARNING:tensorflow:From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W1224 10:32:11.328015 140370471044928 deprecation.py:323] From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/data_decoders/tf_example_decoder.py:182: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

W1224 10:32:11.342888 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/data_decoders/tf_example_decoder.py:182: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/data_decoders/tf_example_decoder.py:197: The name tf.VarLenFeature is deprecated. Please use tf.io.VarLenFeature instead.

W1224 10:32:11.343116 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/data_decoders/tf_example_decoder.py:197: The name tf.VarLenFeature is deprecated. Please use tf.io.VarLenFeature instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/builders/dataset_builder.py:64: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

W1224 10:32:11.356371 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/builders/dataset_builder.py:64: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W1224 10:32:11.357255 140370471044928 dataset_builder.py:72] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/builders/dataset_builder.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.experimental.parallel_interleave(...). W1224 10:32:11.362099 140370471044928 deprecation.py:323] From /home/ws2080/Desktop/codes/models/research/object_detection/builders/dataset_builder.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.experimental.parallel_interleave(...). WARNING:tensorflow:From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_determinstic. W1224 10:32:11.362233 140370471044928 deprecation.py:323] From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_determinstic. WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/builders/dataset_builder.py:155: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map() W1224 10:32:11.377858 140370471044928 deprecation.py:323] From /home/ws2080/Desktop/codes/models/research/object_detection/builders/dataset_builder.py:155: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/utils/ops.py:491: The name tf.is_nan is deprecated. Please use tf.math.is_nan instead.

W1224 10:32:11.489631 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/utils/ops.py:491: The name tf.is_nan is deprecated. Please use tf.math.is_nan instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/utils/ops.py:493: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W1224 10:32:11.492161 140370471044928 deprecation.py:323] From /home/ws2080/Desktop/codes/models/research/object_detection/utils/ops.py:493: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/core/preprocessor.py:627: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W1224 10:32:11.520374 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/core/preprocessor.py:627: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/core/preprocessor.py:2689: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

W1224 10:32:11.550292 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/core/preprocessor.py:2689: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/inputs.py:168: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. W1224 10:32:11.640248 140370471044928 deprecation.py:323] From /home/ws2080/Desktop/codes/models/research/object_detection/inputs.py:168: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/builders/dataset_builder.py:158: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.batch(..., drop_remainder=True). W1224 10:32:11.874983 140370471044928 deprecation.py:323] From /home/ws2080/Desktop/codes/models/research/object_detection/builders/dataset_builder.py:158: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.batch(..., drop_remainder=True). INFO:tensorflow:Calling model_fn. I1224 10:32:11.884140 140370471044928 estimator.py:1145] Calling model_fn. INFO:tensorflow:Scale of 0 disables regularizer. I1224 10:32:11.898115 140370471044928 regularizers.py:98] Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. I1224 10:32:11.898225 140370471044928 regularizers.py:98] Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. I1224 10:32:18.416866 140370471044928 regularizers.py:98] Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. I1224 10:32:18.454785 140370471044928 regularizers.py:98] Scale of 0 disables regularizer. INFO:tensorflow:depth of additional conv before box predictor: 0 I1224 10:32:18.455076 140370471044928 convolutional_box_predictor.py:151] depth of additional conv before box predictor: 0 WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/utils/spatial_transform_ops.py:419: calling crop_and_resize_v1 (from tensorflow.python.ops.image_ops_impl) with box_ind is deprecated and will be removed in a future version. Instructions for updating: box_ind is deprecated, use box_indices instead W1224 10:32:19.056011 140370471044928 deprecation.py:506] From /home/ws2080/Desktop/codes/models/research/object_detection/utils/spatial_transform_ops.py:419: calling crop_and_resize_v1 (from tensorflow.python.ops.image_ops_impl) with box_ind is deprecated and will be removed in a future version. Instructions for updating: box_ind is deprecated, use box_indices instead INFO:tensorflow:Scale of 0 disables regularizer. I1224 10:32:19.066863 140370471044928 regularizers.py:98] Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. I1224 10:32:19.066964 140370471044928 regularizers.py:98] Scale of 0 disables regularizer. WARNING:tensorflow:From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py:1634: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.flatten instead. W1224 10:32:20.122197 140370471044928 deprecation.py:323] From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py:1634: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.flatten instead. INFO:tensorflow:Scale of 0 disables regularizer. I1224 10:32:20.234309 140370471044928 regularizers.py:98] Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. I1224 10:32:20.382894 140370471044928 regularizers.py:98] Scale of 0 disables regularizer. WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/core/losses.py:177: The name tf.losses.huber_loss is deprecated. Please use tf.compat.v1.losses.huber_loss instead.

W1224 10:32:20.556293 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/core/losses.py:177: The name tf.losses.huber_loss is deprecated. Please use tf.compat.v1.losses.huber_loss instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/core/losses.py:183: The name tf.losses.Reduction is deprecated. Please use tf.compat.v1.losses.Reduction instead.

W1224 10:32:20.557175 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/core/losses.py:183: The name tf.losses.Reduction is deprecated. Please use tf.compat.v1.losses.Reduction instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/core/losses.py:350: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

W1224 10:32:20.588500 140370471044928 deprecation.py:323] From /home/ws2080/Desktop/codes/models/research/object_detection/core/losses.py:350: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:380: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

W1224 10:32:20.751060 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:380: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/builders/optimizer_builder.py:58: The name tf.train.MomentumOptimizer is deprecated. Please use tf.compat.v1.train.MomentumOptimizer instead.

W1224 10:32:20.756265 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/builders/optimizer_builder.py:58: The name tf.train.MomentumOptimizer is deprecated. Please use tf.compat.v1.train.MomentumOptimizer instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:408: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

W1224 10:32:20.756437 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:408: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:515: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

W1224 10:32:26.421885 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:515: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:520: The name tf.train.Scaffold is deprecated. Please use tf.compat.v1.train.Scaffold instead.

W1224 10:32:27.109842 140370471044928 deprecation_wrapper.py:119] From /home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py:520: The name tf.train.Scaffold is deprecated. Please use tf.compat.v1.train.Scaffold instead.

INFO:tensorflow:Done calling model_fn. I1224 10:32:27.110154 140370471044928 estimator.py:1147] Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. I1224 10:32:27.111032 140370471044928 basic_session_run_hooks.py:541] Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. I1224 10:32:30.541843 140370471044928 monitored_session.py:240] Graph was finalized. 2019-12-24 10:32:30.542088: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-12-24 10:32:30.546518: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2019-12-24 10:32:30.891659: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x59eb5d0 executing computations on platform CUDA. Devices: 2019-12-24 10:32:30.891713: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5 2019-12-24 10:32:30.891736: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): GeForce RTX 2080, Compute Capability 7.5 2019-12-24 10:32:30.913691: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3700000000 Hz 2019-12-24 10:32:30.914949: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x97c0e80 executing computations on platform Host. Devices: 2019-12-24 10:32:30.914990: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-12-24 10:32:30.916453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.86 pciBusID: 0000:17:00.0 2019-12-24 10:32:30.917590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.86 pciBusID: 0000:b3:00.0 2019-12-24 10:32:30.918012: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-12-24 10:32:30.920033: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-12-24 10:32:30.921862: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-12-24 10:32:30.922387: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-12-24 10:32:30.924782: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-12-24 10:32:30.926631: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-12-24 10:32:30.931489: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-12-24 10:32:30.934425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1 2019-12-24 10:32:30.934479: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-12-24 10:32:30.936606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-12-24 10:32:30.936626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2019-12-24 10:32:30.936636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N N 2019-12-24 10:32:30.936644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: N N 2019-12-24 10:32:30.938963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7468 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:17:00.0, compute capability: 7.5) 2019-12-24 10:32:30.940819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7174 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080, pci bus id: 0000:b3:00.0, compute capability: 7.5) WARNING:tensorflow:From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. W1224 10:32:30.943990 140370471044928 deprecation.py:323] From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from ead/models/model_inception_resnet_v2_atrous/model.ckpt-0 I1224 10:32:30.945783 140370471044928 saver.py:1280] Restoring parameters from ead/models/model_inception_resnet_v2_atrous/model.ckpt-0 WARNING:tensorflow:From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1066: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. W1224 10:32:34.062770 140370471044928 deprecation.py:323] From /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1066: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. 2019-12-24 10:32:35.591353: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. INFO:tensorflow:Running local_init_op. I1224 10:32:35.750933 140370471044928 session_manager.py:500] Running local_init_op. INFO:tensorflow:Done running local_init_op. I1224 10:32:36.082228 140370471044928 session_manager.py:502] Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into ead/models/model_inception_resnet_v2_atrous/model.ckpt. I1224 10:32:46.263195 140370471044928 basic_session_run_hooks.py:606] Saving checkpoints for 0 into ead/models/model_inception_resnet_v2_atrous/model.ckpt. 2019-12-24 10:33:02.056733: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-12-24 10:33:02.420746: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-12-24 10:33:03.245723: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-12-24 10:33:03.260952: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Conv2d_1a_3x3/Conv2D}}]] [[BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/Greater/_11873]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Conv2d_1a_3x3/Conv2D}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "object_detection/model_main.py", line 109, in tf.app.run() File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "object_detection/model_main.py", line 105, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate return executor.run() File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run return self.run_local() File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local saving_listeners=saving_listeners) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default saving_listeners) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1484, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1252, in run run_metadata=run_metadata) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1353, in run raise six.reraise(original_exc_info) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/six.py", line 696, in reraise raise value File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1338, in run return self._sess.run(args, *kwargs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1411, in run run_metadata=run_metadata) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1169, in run return self._sess.run(args, **kwargs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Conv2d_1a_3x3/Conv2D (defined at /tmp/tmp_uy87mo1.py:12) ]] [[BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/Greater/_11873]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Conv2d_1a_3x3/Conv2D (defined at /tmp/tmp_uy87mo1.py:12) ]] 0 successful operations. 0 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Conv2d_1a_3x3/Conv2D: IteratorGetNext (defined at object_detection/model_main.py:105)
FirstStageFeatureExtractor/InceptionResnetV2/Conv2d_1a_3x3/weights/read (defined at /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/slim-0.1-py3.6.egg/nets/inception_resnet_v2.py:163)

Input Source operations connected to node FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Conv2d_1a_3x3/Conv2D: IteratorGetNext (defined at object_detection/model_main.py:105)
FirstStageFeatureExtractor/InceptionResnetV2/Conv2d_1a_3x3/weights/read (defined at /home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/slim-0.1-py3.6.egg/nets/inception_resnet_v2.py:163)

Original stack trace for 'FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Conv2d_1a_3x3/Conv2D': File "object_detection/model_main.py", line 109, in tf.app.run() File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "object_detection/model_main.py", line 105, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate return executor.run() File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run return self.run_local() File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local saving_listeners=saving_listeners) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default features, labels, ModeKeys.TRAIN, self.config) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/home/ws2080/Desktop/codes/models/research/object_detection/model_lib.py", line 308, in model_fn features[fields.InputDataFields.true_image_shape]) File "/home/ws2080/Desktop/codes/models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 836, in predict prediction_dict = self._predict_first_stage(preprocessed_inputs) File "/home/ws2080/Desktop/codes/models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 890, in _predict_first_stage image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/home/ws2080/Desktop/codes/models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1330, in _extract_rpn_feature_maps preprocessed_inputs) File "/home/ws2080/Desktop/codes/models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1355, in _extract_proposal_features scope=self.first_stage_feature_extractor_scope)) File "/home/ws2080/Desktop/codes/models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 169, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/home/ws2080/Desktop/codes/models/research/object_detection/models/faster_rcnn_inception_resnet_v2_feature_extractor.py", line 113, in _extract_proposal_features align_feature_maps=True) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/slim-0.1-py3.6.egg/nets/inception_resnet_v2.py", line 163, in inception_resnet_v2_base scope='Conv2d_1a_3x3') File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args return func(*args, *current_args) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1159, in convolution2d conv_dims=2) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args return func(args, current_args) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution outputs = layer.apply(inputs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply return self.call(inputs, *args, kwargs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 537, in call outputs = super(Layer, self).call(inputs, *args, *kwargs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in call outputs = call_fn(inputs, args, kwargs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 146, in wrapper ), args, kwargs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 450, in converted_call result = converted_f(*effective_args, kwargs) File "/tmp/tmp_uy87mo1.py", line 12, in tfcall outputs = ag.converted_call('_convolution_op', self, ag.ConversionOptions(recursive=True, force_conversion=False, optional_features=(), internal_convert_user_code=True), (inputs, self.kernel), None) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 356, in converted_call return _call_unconverted(f, args, kwargs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 255, in _call_unconverted return f(*args) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1079, in call return self.conv_op(inp, filter) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 635, in call return self.call(inp, filter) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 234, in call__ name=self.name) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d name=name) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d data_format=data_format, dilations=dilations, name=name) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, kwargs) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/home/ws2080/environments/tensorflow_1_15/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

GorkemP commented 4 years ago

I have solved the training issue by doing the followings:

1) After

`session_config = tf.ConfigProto(allow_soft_placement=True,
                                log_device_placement=False)`

I added

 `session_config.gpu_options.allow_growth = True`

2) Instead of running object_detection/model_main.py file as stated in the documentation, I directly run the object_detection/legacy/train.py file.

After these modifications training is run successfully. Now, I want to evaluate evaluation dataset by directly running legacy/eval.py file. Yet, I again get the CUDNN_STATUS_INTERNAL_ERROR

I think legacy/evaluator.py file should be changed as I did in the above but I cannot find where the session object configuration is set.

GorkemP commented 4 years ago

I guess session GPU configurations in the training is not written to the saved model/graph object because when allow_growth=True is set, training works (I think it is due to the RTX cards). Yet, when the saved model is called in evaluation, it gives error as if allow_growth=True option is not set. Is there a way to save this option to saved model?

Praveenk8051 commented 4 years ago

I have solved the training issue by doing the followings:

After session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False) I added session_config.gpu_options.allow_growth = True

Instead of running object_detection/model_main.py file as stated in the documentation, I directly run the object_detection/legacy/train.py file.

After these modifications training is run successfully. Now, I want to evaluate evaluation dataset by directly running legacy/eval.py file. Yet, I again get the CUDNN_STATUS_INTERNAL_ERROR

I think legacy/evaluator.py file should be changed as I did in the above but I cannot find where the session object configuration is set.

@GorkemP In which file was this line present?

GorkemP commented 4 years ago

It is exist in research/object_detection/legacy/trainer.py

Praveenk8051 commented 4 years ago

It is exist in research/object_detection/legacy/trainer.py

Yes, i got that. Thank you. But i am still encountering the same issue. I tried running object_detection/legacy/train.py file, it os showing below error

OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[3,3,64,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc May i know what was your image dimension? Can you please tell me what else solved your issue ? Would be very helpful. I am using RTX 2070 Super, TF: 1.14.0

GorkemP commented 4 years ago

Image dimension is 512x512. Maybe you can try to decrease batch size, it seems that there is a memory issue.

Praveenk8051 commented 4 years ago

Image dimension is 512x512. Maybe you can try to decrease batch size, it seems that there is a memory issue.

Mine is 4k images, 4032x3024 dimension. I guess, I should reduce this.

Praveenk8051 commented 4 years ago

Image dimension is 512x512. Maybe you can try to decrease batch size, it seems that there is a memory issue.

Did you run eval.py ? I am getting the same error for that.

charming16 commented 4 years ago

@Praveenk8051 Same problem! Have you find the solution? :)

Praveenk8051 commented 4 years ago

@charming16 Happy to help you. So there are 2 ways you can train the model, One by using research/object_detection/legacy/trainer.py and another by model_main.py

I first executed model_main.py and got the captioned error. Then I executed trainer.py with the presence of GPU it worked like a charm.

Then I executed the eval.py on a machine which had no GPU and it worked. So basically i used system with GPU to train and system with no GPU to evaluate

Pavless commented 4 years ago

What fixed the problem for me was adding these few lines of code directly into the model_main.py, I replaced this line config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir) with session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False) session_config.gpu_options.allow_growth = True session_config.gpu_options.per_process_gpu_memory_fraction = .8 config=tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config) and everything works just fine without using the legacy scripts, both training and evaluation. My GPU is RTX 2070 Super.

Note: The line session_config.gpu_options.per_process_gpu_memory_fraction = .8 was in my case necessary for evaluation to run without an OUT_OF_MEMORY error.

charming16 commented 4 years ago

@charming16 Happy to help you. So there are 2 ways you can train the model, One by using research/object_detection/legacy/trainer.py and another by model_main.py

I first executed model_main.py and got the captioned error. Then I executed trainer.py with the presence of GPU it worked like a charm.

Then I executed the eval.py on a machine which had no GPU and it worked. So basically i used system with GPU to train and system with no GPU to evaluate

I truly appreciate your timely help! I've trained my ssd-mobilenet model and it works perfectly after using legacy/train.py. @Praveenk8051

charming16 commented 4 years ago

What fixed the problem for me was adding these few lines of code directly into the model_main.py, I replaced this line config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir) with session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False) session_config.gpu_options.allow_growth = True session_config.gpu_options.per_process_gpu_memory_fraction = .8 config=tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config) and everything works just fine without using the legacy scripts, both training and evaluation. My GPU is RTX 2070 Super.

Note: The line session_config.gpu_options.per_process_gpu_memory_fraction = .8 was in my case necessary for evaluation to run without an OUT_OF_MEMORY error.

Thank you so much @Pavless. OUT_OF_MEMORY error also troubles me a lot. I'll have a try later:)

kyscg commented 4 years ago

Closing this as the issue seems to be resolved! Feel free to reopen if anything else comes up.

tensorflow / models