Open duffjay opened 3 years ago
I re-created all of this on another machine with just one GPU (GTX 1050 TI). Same results:
so evidently, this isn't related to two GPUs. Same results on: 2 x RTX 2080 TIs 1 x GTX 1050 TI
both setups are the same software - TF 2.5.0, CUDA 11.2, cuDNN 8.1 etc
Is there a workaround for this problem? for example, can I train from scratch to avoid the problem? (if so, how do I train from scratch as opposed to the checkpoint?)
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py
2. Describe the bug
environment:
following the instructions in the objection detection API Tutorial - some specifics, Ubuntu 20.04, CUDA 11.2 cdDNN 8.1.0, python 3.9, two (2) RTX 2080 TIs
reset50 640x640 trains perfectly
I trained ssd_resnet50_v1_fpn_640x640_coco17_tpu-8 successfully.
resnet101 assertion error
However, when I go to train any of the following: ssd_resnet101_v1_fpn_640x640_coco17_tpu-8 centernet_resnet50_v1_fpn_512x512_coco17_tpu-8
I get as assertion error: 2021-06-20 15:29:25.369972: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:26.405916: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2021-06-20 15:29:26.459184: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.459648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:0b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.73GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.459692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.460325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:0c:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.635GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.460338: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:26.461750: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2021-06-20 15:29:26.461776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2021-06-20 15:29:26.462242: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2021-06-20 15:29:26.462349: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2021-06-20 15:29:26.462741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2021-06-20 15:29:26.463069: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2021-06-20 15:29:26.463146: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2021-06-20 15:29:26.463191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.463657: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.464999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2021-06-20 15:29:26.465217: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-06-20 15:29:26.627086: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.627495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:0b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.73GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.627553: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.627937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:0c:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.635GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s 2021-06-20 15:29:26.627980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.628391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.628796: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.629191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:26.629567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1 2021-06-20 15:29:26.629591: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-06-20 15:29:27.070201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-06-20 15:29:27.070228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2021-06-20 15:29:27.070233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N N 2021-06-20 15:29:27.070236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: N N 2021-06-20 15:29:27.070425: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.070913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.071350: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.071770: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.072196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.072590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9101 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0b:00.0, compute capability: 7.5) 2021-06-20 15:29:27.072829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-06-20 15:29:27.073228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 9648 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled. W0620 15:29:27.075164 140570248446784 mirrored_strategy.py:379] Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') I0620 15:29:27.184911 140570248446784 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1') INFO:tensorflow:Maybe overwriting train_steps: None I0620 15:29:27.187189 140570248446784 config_util.py:552] Maybe overwriting train_steps: None INFO:tensorflow:Maybe overwriting use_bfloat16: False I0620 15:29:27.187254 140570248446784 config_util.py:552] Maybe overwriting use_bfloat16: False WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py:557: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W0620 15:29:27.198297 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py:557: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['/hsdata/tfrecord/train.record'] I0620 15:29:27.206228 140570248446784 dataset_builder.py:163] Reading unweighted datasets: ['/hsdata/tfrecord/train.record'] INFO:tensorflow:Reading record datasets for input file: ['/hsdata/tfrecord/train.record'] I0620 15:29:27.208479 140570248446784 dataset_builder.py:80] Reading record datasets for input file: ['/hsdata/tfrecord/train.record'] INFO:tensorflow:Number of filenames to read: 32 I0620 15:29:27.208534 140570248446784 dataset_builder.py:81] Number of filenames to read: 32 WARNING:tensorflow:num_readers has been reduced to 32 to match input file shards. W0620 15:29:27.208574 140570248446784 dataset_builder.py:87] num_readers has been reduced to 32 to match input file shards. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:101: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use
tf.compat.v1.app.run()
File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/project/tensorflow/workspace/training_demo/model_main_tf2.py", line 106, in main
model_lib_v2.train_loop(
File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 599, in train_loop
load_fine_tune_checkpoint(
File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 400, in load_fine_tune_checkpoint
ckpt.restore(
File "/home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/training/tracking/util.py", line 807, in assert_existing_objects_matched
raise AssertionError(
AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program: [MirroredVariable:{
0: <tf.Variable 'conv4_block1_3_bn/beta:0' shape=(1024,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>,
1: <tf.Variable 'conv4_block1_3_bn/beta/replica_1:0' shape=(1024,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>
}, MirroredVariable:{
0: <tf.Variable 'conv3_block2_1_conv/kernel:0' shape=(1, 1, 512, 128) dtype=float32, numpy=
array([[[[-0.01696381, -0.01267319, -0.03627143, ..., 0.00788827,
0.0303973 , -0.00643146],
[ 0.00642839, -0.02894838, 0.01284949, ..., -0.00388066,
-0.0148216 , 0.05451383],
[ 0.04492364, 0.02437586, -0.0175305 , ..., 0.02459151,
0.007246 , 0.0079805 ],
...,
[-0.02434192, -0.00606417, -0.0251104 , ..., 0.04526192,
0.00785731, -0.01917633],
[-0.00362424, -0.00965281, -0.05476727, ..., 0.0453129 ,
-0.01637888, -0.02966581],
[ 0.01875794, -0.001552 , -0.05092196, ..., -0.01918735,
-0.00485154, 0.00121295]]]], dtype=float32)>,
1: <tf.Variable 'conv3_block2_1_conv/kernel/replica_1:0' shape=(1, 1, 512, 128) dtype=float32, numpy=
array([[[[-0.01696381, -0.01267319, -0.03627143, ..., 0.00788827,
0.0303973 , -0.00643146],
[ 0.00642839, -0.02894838, 0.01284949, ..., -0.00388066,
-0.0148216 , 0.05451383],
[ 0.04492364, 0.02437586, -0.0175305 , ..., 0.02459151,
0.007246 , 0.0079805 ],
...,
[-0.02434192, -0.00606417, -0.0251104 , ..., 0.04526192,
0.00785731, -0.01917633],
[-0.00362424, -0.00965281, -0.05476727, ..., 0.0453129 ,
-0.01637888, -0.02966581],
[ 0.01875794, -0.001552 , -0.05092196, ..., -0.01918735,
-0.00485154, 0.00121295]]]], dtype=float32)>
}, MirroredVariable:{
0: <tf.Variable 'conv4_block19_2_bn/gamma:0' shape=(256,) dtype=float32, numpy=
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1.], dtype=float32)>,
1: <tf.Variable 'conv4_block19_2_bn/gamma/replica_1:0' shape=(256,) dtype=float32, numpy=
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.
tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)
instead. If sloppy execution is desired, usetf.data.Options.experimental_deterministic
. W0620 15:29:27.209816 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:101: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)
instead. If sloppy execution is desired, usetf.data.Options.experimental_deterministic
. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:236: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() W0620 15:29:27.221279 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:236: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use
tf.data.Dataset.map() WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create atf.sparse.SparseTensor
and usetf.sparse.to_dense
instead. W0620 15:29:30.622751 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create atf.sparse.SparseTensor
and usetf.sparse.to_dense
instead. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating:seed2
arg is deprecated.Use sample_distorted_bounding_box_v2 instead. W0620 15:29:32.084338 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:206: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version. Instructions for updating:seed2
arg is deprecated.Use sample_distorted_bounding_box_v2 instead. WARNING:tensorflow:From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.cast
instead. W0620 15:29:32.914312 140570248446784 deprecation.py:330] From /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.cast
instead. 2021-06-20 15:29:34.039495: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) 2021-06-20 15:29:34.060247: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3399645000 Hz /home/jay/anaconda3/envs/tf25/lib/python3.9/site-packages/tensorflow/python/keras/backend.py:435: UserWarning:tf.keras.backend.set_learning_phase
is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to thetraining
argument of the__call__
method of your layer or model. warnings.warn('tf.keras.backend.set_learning_phase
is deprecated and ' 2021-06-20 15:29:52.184731: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2021-06-20 15:29:52.452035: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100 2021-06-20 15:29:52.786849: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100 2021-06-20 15:29:52.940519: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2021-06-20 15:29:53.253534: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 Traceback (most recent call last): File "/project/tensorflow/workspace/training_demo/model_main_tf2.py", line 115, in3. Steps to reproduce
Steps to reproduce the behavior. installed Object Detection API per tutorial
resnet50
config file
resnet101
config file
4. Expected behavior
A clear and concise description of what you expected to happen. expected resnet101 to train same as resnet50
5. Additional context
Include any logs that would be helpful to diagnose the problem.
6. System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
Mobile device name if the issue happens on a mobile device:
TensorFlow installed from (source or binary): pip install tensorflow-gpu==2.5.0
TensorFlow version (use command below):
2021-06-20 15:59:01.002826: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0
Python version: Python 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] :: Anaconda, Inc. on linux
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.2 / 8.1.0
GPU model and memory: TWO RTX 2080 TI PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') 10985 MB PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU') 11019 MB
$ nvidia-smi Sun Jun 20 16:01:26 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:0B:00.0 On | N/A | | 0% 30C P8 26W / 250W | 610MiB / 10985MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:0C:00.0 Off | N/A | | 0% 30C P8 1W / 260W | 10MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1605 G /usr/lib/xorg/Xorg 102MiB | | 0 N/A N/A 2305 G /usr/lib/xorg/Xorg 339MiB | | 0 N/A N/A 2439 G /usr/bin/gnome-shell 69MiB | | 0 N/A N/A 3126 G ...AAAAAAAAA= --shared-files 85MiB | | 1 N/A N/A 1605 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2305 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+