google-research / deeplab2

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a unified and state-of-the-art TensorFlow codebase for dense pixel labeling tasks.
Apache License 2.0
1k stars 159 forks source link

Data loss: corrupted record at 0 #95

Closed robonrrd closed 2 years ago

robonrrd commented 2 years ago

Ubuntu 20 Python 3.8.10 Tensorflow 2.5 NVIDIA Titan RTX and NVIDIA Titan X CUDA Version: 11.4

When I run the following command: python deeplab2/trainer/train.py --config_file=./deeplab2/configs/cityscapes/panoptic_deeplab/resnet50_os32_merge_with_pure_tf_func.textproto --mode=eval --model_dir=. --num_gpus=1

I get the following error:

2022-03-25 15:25:18.774693: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
I0325 15:25:20.391919 140284223305536 train.py:65] Reading the config file.
I0325 15:25:20.395706 140284223305536 train.py:69] Starting the experiment.
2022-03-25 15:25:20.397703: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-03-25 15:25:20.449437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2022-03-25 15:25:20.450306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:03:00.0 name: NVIDIA TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2022-03-25 15:25:20.450328: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-03-25 15:25:20.453842: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2022-03-25 15:25:20.453883: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2022-03-25 15:25:20.455577: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-03-25 15:25:20.455783: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-03-25 15:25:20.456216: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2022-03-25 15:25:20.456952: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2022-03-25 15:25:20.457090: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-03-25 15:25:20.460701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
2022-03-25 15:25:20.461012: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-25 15:25:20.743150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2022-03-25 15:25:20.743871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:03:00.0 name: NVIDIA TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2022-03-25 15:25:20.747358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
2022-03-25 15:25:20.747408: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-03-25 15:25:21.602245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-03-25 15:25:21.602280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1 
2022-03-25 15:25:21.602287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N N 
2022-03-25 15:25:21.602291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   N N 
2022-03-25 15:25:21.606291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21512 MB memory) -> physical GPU (device: 0, name: NVIDIA TITAN RTX, pci bus id: 0000:01:00.0, compute capability: 7.5)
2022-03-25 15:25:21.607156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11425 MB memory) -> physical GPU (device: 1, name: NVIDIA TITAN X (Pascal), pci bus id: 0000:03:00.0, compute capability: 6.1)
I0325 15:25:21.608190 140284223305536 train_lib.py:104] Using strategy <class 'tensorflow.python.distribute.one_device_strategy.OneDeviceStrategy'> with 1 replicas
I0325 15:25:21.875683 140284223305536 deeplab.py:57] Synchronized Batchnorm is used.
I0325 15:25:21.877165 140284223305536 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 32, 'classification_mode': True, 'backbone_type': 'resnet', 'use_axial_beyond_stride': 0, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'tensorflow.python.keras.layers.normalization_v2.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0325 15:25:22.024067 140284223305536 deeplab.py:96] Setting pooling size to (33, 65)
I0325 15:25:22.024253 140284223305536 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0325 15:25:22.024336 140284223305536 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
2022-03-25 15:25:25.563853: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
I0325 15:25:25.565743 140284223305536 controller.py:391] restoring or initializing model...
restoring or initializing model...
WARNING:tensorflow:From /home/me/.local/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py:1315: NameBasedSaverStatus.__init__ (from tensorflow.python.training.tracking.util) is deprecated and will be removed in a future version.
Instructions for updating:
Restoring a name-based tf.train.Saver checkpoint using the object-based restore API. This mode uses global names to match variables, and so is somewhat fragile. It also adds new restore ops to the graph each time it is called when graph building. Prefer re-encoding training checkpoints in the object-based format: run save() on the object-based saver (the same one this message is coming from) and use that checkpoint in the future.
W0325 15:25:25.584475 140284223305536 deprecation.py:330] From /home/me/.local/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py:1315: NameBasedSaverStatus.__init__ (from tensorflow.python.training.tracking.util) is deprecated and will be removed in a future version.
Instructions for updating:
Restoring a name-based tf.train.Saver checkpoint using the object-based restore API. This mode uses global names to match variables, and so is somewhat fragile. It also adds new restore ops to the graph each time it is called when graph building. Prefer re-encoding training checkpoints in the object-based format: run save() on the object-based saver (the same one this message is coming from) and use that checkpoint in the future.
I0325 15:25:25.598557 140284223305536 controller.py:397] initialized model.
initialized model.
I0325 15:25:25.601044 140284223305536 controller.py:277]  eval | step:      0 | running complete evaluation...
 eval | step:      0 | running complete evaluation...
2022-03-25 15:25:25.957597: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-03-25 15:25:25.977684: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3299905000 Hz
I0325 15:25:27.446719 140284223305536 api.py:446] Eval with scales ListWrapper([1.0])
I0325 15:25:28.432155 140284223305536 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0325 15:25:28.461583 140284223305536 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0325 15:25:28.487152 140284223305536 api.py:446] Eval scale 1.0; setting pooling size to [33, 65]
I0325 15:25:32.265706 140284223305536 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0325 15:25:32.291409 140284223305536 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
WARNING:tensorflow:From /home/me/.local/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
W0325 15:25:32.952551 140284223305536 deprecation.py:528] From /home/me/.local/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
I0325 15:25:34.675148 140284223305536 api.py:446] Eval with scales ListWrapper([1.0])
I0325 15:25:34.700171 140284223305536 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0325 15:25:34.724930 140284223305536 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0325 15:25:34.748688 140284223305536 api.py:446] Eval scale 1.0; setting pooling size to [33, 65]
I0325 15:25:35.657829 140284223305536 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0325 15:25:35.683211 140284223305536 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
2022-03-25 15:25:36.482976: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 16801800 exceeds 10% of free system memory.
2022-03-25 15:25:36.522768: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 16801800 exceeds 10% of free system memory.
2022-03-25 15:25:36.522816: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 16801800 exceeds 10% of free system memory.
2022-03-25 15:25:36.524742: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 16801800 exceeds 10% of free system memory.
2022-03-25 15:25:36.576277: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 16801800 exceeds 10% of free system memory.
2022-03-25 15:25:36.971095: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:808] layout failed: Invalid argument: Size of values 3 does not match size of permutation 4 @ fanin shape inDeepLab/PostProcessor/StatefulPartitionedCall/while/body/_85/while/SelectV2_1-1-TransposeNHWCToNCHW-LayoutOptimizer
Traceback (most recent call last):
  File "deeplab2/trainer/train.py", line 76, in <module>
    app.run(main)
  File "/home/me/.local/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/me/.local/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "deeplab2/trainer/train.py", line 71, in main
    train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master,
  File "/media/me/storage/src/github/STEP/deeplab2/trainer/train_lib.py", line 200, in run_experiment
    controller.evaluate(steps=config.evaluator_options.eval_steps)
  File "/home/me/src/github/STEP/models/orbit/controller.py", line 282, in evaluate
    eval_output = self.evaluator.evaluate(steps_tensor)
  File "/home/me/src/github/STEP/models/orbit/standard_runner.py", line 346, in evaluate
    outputs = self._eval_loop_fn(
  File "/home/me/src/github/STEP/models/orbit/utils/loop_fns.py", line 75, in loop_fn
    outputs = step_fn(iterator)
  File "/home/me/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/home/me/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/me/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3023, in __call__
    return graph_function._call_flat(
  File "/home/me/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/me/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/home/me/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.DataLossError: 2 root error(s) found.
  (0) Data loss:  corrupted record at 0
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNext]]
     [[DeepLab/PostProcessor/StatefulPartitionedCall/while/LoopCond/_110/_104]]
  (1) Data loss:  corrupted record at 0
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_eval_step_10512]

Function call stack:
eval_step -> eval_step
laurenf3395 commented 2 years ago

@robonrrd I am having the same issue, were you able to resolve it?

aquariusjay commented 2 years ago

Please check if you have successfully generated the TFRecord for your target dataset. You could at least see if the TFRecord contains something to make sure.

aquariusjay commented 2 years ago

Closing the issue, as there is no activity for a while.