tensorflow / models

Models and examples built with TensorFlow
Other
77.18k stars 45.75k forks source link

training resnet101 on two GPUs - AssertionError: Some Python objects were not bound to checkpointed values, #10083

Open duffjay opened 3 years ago

duffjay commented 3 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

environment:

following the instructions in the objection detection API Tutorial - some specifics, Ubuntu 20.04, CUDA 11.2 cdDNN 8.1.0, python 3.9, two (2) RTX 2080 TIs

reset50 640x640 trains perfectly

I trained ssd_resnet50_v1_fpn_640x640_coco17_tpu-8 successfully.

resnet101 assertion error

However, when I go to train any of the following: ssd_resnet101_v1_fpn_640x640_coco17_tpu-8 centernet_resnet50_v1_fpn_512x512_coco17_tpu-8

I get as assertion error: 2021-06-20 15:29:25.369972: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:26.405916: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2021-06-20 15:29:26.459184: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.459648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:0b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.73GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.459692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.460325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:0c:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.635GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.460338: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:26.461750: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2021-06-20 15:29:26.461776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2021-06-20 15:29:26.462242: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2021-06-20 15:29:26.462349: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2021-06-20 15:29:26.462741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2021-06-20 15:29:26.463069: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2021-06-20 15:29:26.463146: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2021-06-20 15:29:26.463191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.463657: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2021-06-20 15:29:26.465217: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-06-20 15:29:26.627086: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.627495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:0b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.73GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.627553: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.627937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:0c:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.635GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.627980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.628391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.628796: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.629191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.629567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2021-06-20 15:29:26.629591: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:27.070201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-06-20 15:29:27.070228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2021-06-20 15:29:27.070233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N N 2021-06-20 15:29:27.070236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: N N 2021-06-20 15:29:27.070425: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.070913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.071350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.071770: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.072196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.072590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9101 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0b:00.0, compute capability: 7.5) 2021-06-20 15:29:27.072829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.073228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9648 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled. W0620 15:29:27.075164 140570248446784 mirrored_strategy.py:379] Collective ops is not configured at program startup. Some performance features may not be enabled.

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') I0620 15:29:27.184911 140570248446784 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') INFO:tensorflow:Maybe overwriting train_steps: None I0620 15:29:27.187189 140570248446784 config_util.py:552] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I0620 15:29:27.187254 140570248446784 config_util.py:552] Maybe overwriting use_bfloat16: False WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py:557: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W0620 15:29:27.198297 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py:557: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['/hsdata/tfrecord/train.record'] I0620 15:29:27.206228 140570248446784 dataset_builder.py:163] Reading unweighted datasets: ['/hsdata/tfrecord/train.record'] INFO:tensorflow:Reading record datasets for input file: ['/hsdata/tfrecord/train.record'] I0620 15:29:27.208479 140570248446784 dataset_builder.py:80] Reading record datasets for input file: ['/hsdata/tfrecord/train.record'] INFO:tensorflow:Number of filenames to read: 32 I0620 15:29:27.208534 140570248446784 dataset_builder.py:81] Number of filenames to read: 32 WARNING:tensorflow:num_readers has been reduced to 32 to match input file shards. W0620 15:29:27.208574 140570248446784 dataset_builder.py:87] num_readers has been reduced to 32 to match input file shards. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:101: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. W0620 15:29:27.209816 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:101: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:236: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map() W0620 15:29:27.221279 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:236: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. W0620 15:29:30.622751 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating: seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead. W0620 15:29:32.084338 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating: seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. W0620 15:29:32.914312 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. 2021-06-20 15:29:34.039495: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) 2021-06-20 15:29:34.060247: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3399645000 Hz /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/keras/backend.py:435: UserWarning: tf.keras.backend.set_learning_phase is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the training argument of the __call__ method of your layer or model. warnings.warn('tf.keras.backend.set_learning_phase is deprecated and ' 2021-06-20 15:29:52.184731: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2021-06-20 15:29:52.452035: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100 2021-06-20 15:29:52.786849: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100 2021-06-20 15:29:52.940519: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2021-06-20 15:29:53.253534: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 Traceback (most recent call last): File "/project/tensorflow/workspace/training_demo/model_main_tf2.py", line 115, in tf.compat.v1.app.run() File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/project/tensorflow/workspace/training_demo/model_main_tf2.py", line 106, in main model_lib_v2.train_loop( File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 599, in train_loop load_fine_tune_checkpoint( File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 400, in load_fine_tune_checkpoint ckpt.restore( File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/training/tracking/util.py", line 807, in assert_existing_objects_matched raise AssertionError( AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program: [MirroredVariable:{ 0: <tf.Variable 'conv4_block1_3_bn/beta:0' shape=(1024,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>, 1: <tf.Variable 'conv4_block1_3_bn/beta/replica_1:0' shape=(1024,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)> }, MirroredVariable:{ 0: <tf.Variable 'conv3_block2_1_conv/kernel:0' shape=(1, 1, 512, 128) dtype=float32, numpy= array([[[[-0.01696381, -0.01267319, -0.03627143, ..., 0.00788827, 0.0303973 , -0.00643146], [ 0.00642839, -0.02894838, 0.01284949, ..., -0.00388066, -0.0148216 , 0.05451383], [ 0.04492364, 0.02437586, -0.0175305 , ..., 0.02459151, 0.007246 , 0.0079805 ], ..., [-0.02434192, -0.00606417, -0.0251104 , ..., 0.04526192, 0.00785731, -0.01917633], [-0.00362424, -0.00965281, -0.05476727, ..., 0.0453129 , -0.01637888, -0.02966581], [ 0.01875794, -0.001552 , -0.05092196, ..., -0.01918735, -0.00485154, 0.00121295]]]], dtype=float32)>, 1: <tf.Variable 'conv3_block2_1_conv/kernel/replica_1:0' shape=(1, 1, 512, 128) dtype=float32, numpy= array([[[[-0.01696381, -0.01267319, -0.03627143, ..., 0.00788827, 0.0303973 , -0.00643146], [ 0.00642839, -0.02894838, 0.01284949, ..., -0.00388066, -0.0148216 , 0.05451383], [ 0.04492364, 0.02437586, -0.0175305 , ..., 0.02459151, 0.007246 , 0.0079805 ], ..., [-0.02434192, -0.00606417, -0.0251104 , ..., 0.04526192, 0.00785731, -0.01917633], [-0.00362424, -0.00965281, -0.05476727, ..., 0.0453129 , -0.01637888, -0.02966581], [ 0.01875794, -0.001552 , -0.05092196, ..., -0.01918735, -0.00485154, 0.00121295]]]], dtype=float32)> }, MirroredVariable:{ 0: <tf.Variable 'conv4_block19_2_bn/gamma:0' shape=(256,) dtype=float32, numpy= array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)>, 1: <tf.Variable 'conv4_block19_2_bn/gamma/replica_1:0' shape=(256,) dtype=float32, numpy= array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.

3. Steps to reproduce

Steps to reproduce the behavior. installed Object Detection API per tutorial

resnet50

config file

resnet101

config file

4. Expected behavior

A clear and concise description of what you expected to happen. expected resnet101 to train same as resnet50

5. Additional context

Include any logs that would be helpful to diagnose the problem.

6. System information

$ nvidia-smi Sun Jun 20 16:01:26 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:0B:00.0 On | N/A | | 0% 30C P8 26W / 250W | 610MiB / 10985MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:0C:00.0 Off | N/A | | 0% 30C P8 1W / 260W | 10MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1605 G /usr/lib/xorg/Xorg 102MiB | | 0 N/A N/A 2305 G /usr/lib/xorg/Xorg 339MiB | | 0 N/A N/A 2439 G /usr/bin/gnome-shell 69MiB | | 0 N/A N/A 3126 G ...AAAAAAAAA= --shared-files 85MiB | | 1 N/A N/A 1605 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2305 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+

duffjay commented 3 years ago

I re-created all of this on another machine with just one GPU (GTX 1050 TI). Same results:

so evidently, this isn't related to two GPUs. Same results on: 2 x RTX 2080 TIs 1 x GTX 1050 TI

both setups are the same software - TF 2.5.0, CUDA 11.2, cuDNN 8.1 etc

duffjay commented 3 years ago

Is there a workaround for this problem? for example, can I train from scratch to avoid the problem? (if so, how do I train from scratch as opposed to the checkpoint?)