Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[x ] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[ x] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

environment:

following the instructions in the objection detection API Tutorial - some specifics, Ubuntu 20.04, CUDA 11.2 cdDNN 8.1.0, python 3.9, two (2) RTX 2080 TIs

reset50 640x640 trains perfectly

I trained ssd_resnet50_v1_fpn_640x640_coco17_tpu-8 successfully.

resnet101 assertion error

However, when I go to train any of the following: ssd_resnet101_v1_fpn_640x640_coco17_tpu-8 centernet_resnet50_v1_fpn_512x512_coco17_tpu-8

I get as assertion error: 2021-06-20 15:29:25.369972: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:26.405916: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2021-06-20 15:29:26.459184: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.459648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:0b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.73GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.459692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.460325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:0c:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.635GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.460338: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:26.461750: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2021-06-20 15:29:26.461776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2021-06-20 15:29:26.462242: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2021-06-20 15:29:26.462349: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2021-06-20 15:29:26.462741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2021-06-20 15:29:26.463069: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2021-06-20 15:29:26.463146: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2021-06-20 15:29:26.463191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.463657: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2021-06-20 15:29:26.465217: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-06-20 15:29:26.627086: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.627495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:0b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.73GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.627553: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.627937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:0c:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.635GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.627980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.628391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.628796: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.629191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.629567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2021-06-20 15:29:26.629591: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:27.070201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-06-20 15:29:27.070228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2021-06-20 15:29:27.070233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N N 2021-06-20 15:29:27.070236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: N N 2021-06-20 15:29:27.070425: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.070913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.071350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.071770: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.072196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.072590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9101 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0b:00.0, compute capability: 7.5) 2021-06-20 15:29:27.072829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.073228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9648 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled. W0620 15:29:27.075164 140570248446784 mirrored_strategy.py:379] Collective ops is not configured at program startup. Some performance features may not be enabled.

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') I0620 15:29:27.184911 140570248446784 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') INFO:tensorflow:Maybe overwriting train_steps: None I0620 15:29:27.187189 140570248446784 config_util.py:552] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I0620 15:29:27.187254 140570248446784 config_util.py:552] Maybe overwriting use_bfloat16: False WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py:557: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W0620 15:29:27.198297 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py:557: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['/hsdata/tfrecord/train.record'] I0620 15:29:27.206228 140570248446784 dataset_builder.py:163] Reading unweighted datasets: ['/hsdata/tfrecord/train.record'] INFO:tensorflow:Reading record datasets for input file: ['/hsdata/tfrecord/train.record'] I0620 15:29:27.208479 140570248446784 dataset_builder.py:80] Reading record datasets for input file: ['/hsdata/tfrecord/train.record'] INFO:tensorflow:Number of filenames to read: 32 I0620 15:29:27.208534 140570248446784 dataset_builder.py:81] Number of filenames to read: 32 WARNING:tensorflow:num_readers has been reduced to 32 to match input file shards. W0620 15:29:27.208574 140570248446784 dataset_builder.py:87] num_readers has been reduced to 32 to match input file shards. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:101: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. W0620 15:29:27.209816 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:101: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:236: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map() W0620 15:29:27.221279 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:236: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. W0620 15:29:30.622751 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating: seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead. W0620 15:29:32.084338 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating: seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. W0620 15:29:32.914312 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. 2021-06-20 15:29:34.039495: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) 2021-06-20 15:29:34.060247: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3399645000 Hz /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/keras/backend.py:435: UserWarning: tf.keras.backend.set_learning_phase is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the training argument of the __call__ method of your layer or model. warnings.warn('tf.keras.backend.set_learning_phase is deprecated and ' 2021-06-20 15:29:52.184731: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2021-06-20 15:29:52.452035: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100 2021-06-20 15:29:52.786849: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100 2021-06-20 15:29:52.940519: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2021-06-20 15:29:53.253534: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 Traceback (most recent call last): File "/project/tensorflow/workspace/training_demo/model_main_tf2.py", line 115, in tf.compat.v1.app.run() File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/project/tensorflow/workspace/training_demo/model_main_tf2.py", line 106, in main model_lib_v2.train_loop( File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 599, in train_loop load_fine_tune_checkpoint( File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 400, in load_fine_tune_checkpoint ckpt.restore( File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/training/tracking/util.py", line 807, in assert_existing_objects_matched raise AssertionError( AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program: [MirroredVariable:{ 0: <tf.Variable 'conv4_block1_3_bn/beta:0' shape=(1024,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>, 1: <tf.Variable 'conv4_block1_3_bn/beta/replica_1:0' shape=(1024,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)> }, MirroredVariable:{ 0: <tf.Variable 'conv3_block2_1_conv/kernel:0' shape=(1, 1, 512, 128) dtype=float32, numpy= array([[[[-0.01696381, -0.01267319, -0.03627143, ..., 0.00788827, 0.0303973 , -0.00643146], [ 0.00642839, -0.02894838, 0.01284949, ..., -0.00388066, -0.0148216 , 0.05451383], [ 0.04492364, 0.02437586, -0.0175305 , ..., 0.02459151, 0.007246 , 0.0079805 ], ..., [-0.02434192, -0.00606417, -0.0251104 , ..., 0.04526192, 0.00785731, -0.01917633], [-0.00362424, -0.00965281, -0.05476727, ..., 0.0453129 , -0.01637888, -0.02966581], [ 0.01875794, -0.001552 , -0.05092196, ..., -0.01918735, -0.00485154, 0.00121295]]]], dtype=float32)>, 1: <tf.Variable 'conv3_block2_1_conv/kernel/replica_1:0' shape=(1, 1, 512, 128) dtype=float32, numpy= array([[[[-0.01696381, -0.01267319, -0.03627143, ..., 0.00788827, 0.0303973 , -0.00643146], [ 0.00642839, -0.02894838, 0.01284949, ..., -0.00388066, -0.0148216 , 0.05451383], [ 0.04492364, 0.02437586, -0.0175305 , ..., 0.02459151, 0.007246 , 0.0079805 ], ..., [-0.02434192, -0.00606417, -0.0251104 , ..., 0.04526192, 0.00785731, -0.01917633], [-0.00362424, -0.00965281, -0.05476727, ..., 0.0453129 , -0.01637888, -0.02966581], [ 0.01875794, -0.001552 , -0.05092196, ..., -0.01918735, -0.00485154, 0.00121295]]]], dtype=float32)> }, MirroredVariable:{ 0: <tf.Variable 'conv4_block19_2_bn/gamma:0' shape=(256,) dtype=float32, numpy= array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)>, 1: <tf.Variable 'conv4_block19_2_bn/gamma/replica_1:0' shape=(256,) dtype=float32, numpy= array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.

3. Steps to reproduce

Steps to reproduce the behavior. installed Object Detection API per tutorial

resnet50

config file

edited num_classes, pbtxt, path to data
trained, exported,
works perfectly

resnet101

config file

same edits
error on training

4. Expected behavior

A clear and concise description of what you expected to happen. expected resnet101 to train same as resnet50

5. Additional context

Include any logs that would be helpful to diagnose the problem.

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
Mobile device name if the issue happens on a mobile device:
TensorFlow installed from (source or binary): pip install tensorflow-gpu==2.5.0
TensorFlow version (use command below):
2021-06-20 15:59:01.002826: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0
Python version: Python 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] :: Anaconda, Inc. on linux
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.2 / 8.1.0
GPU model and memory: TWO RTX 2080 TI PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') 10985 MB PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU') 11019 MB

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1605 G /usr/lib/xorg/Xorg 102MiB | | 0 N/A N/A 2305 G /usr/lib/xorg/Xorg 339MiB | | 0 N/A N/A 2439 G /usr/bin/gnome-shell 69MiB | | 0 N/A N/A 3126 G ...AAAAAAAAA= --shared-files 85MiB | | 1 N/A N/A 1605 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2305 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+

tensorflow / models

training resnet101 on two GPUs - AssertionError: Some Python objects were not bound to checkpointed values, #10083