Closed prabal27 closed 2 years ago
Hi @prabal27,
Thanks for reporting the issue. The provided resnet50_os32_semseg.textproto in fact still uses the panoptic segmentation dataset, but disables the instance segmentation prediction branch in the Panoptic-DeepLab. Therefore, you should set --create_panoptic_data=true when converting the Cityscapes dataset.
Cheers,
Hi @aquariusjay,
That worked, thanks a lot!!
Hi @prabal27,
Glad to know it works. Closing the issue. Feel free to reopen it, if you encounter any other issues.
Cheers,
System Details: OS: Ubuntu 20.04 Python 3.9.12 Tensorflow 2.6.0
Note: I prepared the cityscapes dataset for semantic segmentation as advised using --create_panoptic_data=false
After running the following command:
python trainer/train.py --config_file=configs/cityscapes/panoptic_deeplab/resnet50_os32_semseg.textproto --mode=train --model_dir=Model_Output/ --num_gpus=1
I get this error log.
2022-06-08 14:43:13.546171: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 14:43:13.550430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 14:43:13.550525: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I0608 14:43:13.550683 140528093734080 train.py:65] Reading the config file. I0608 14:43:13.551881 140528093734080 train.py:69] Starting the experiment. 2022-06-08 14:43:13.552163: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-06-08 14:43:13.552690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 14:43:13.552804: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 14:43:13.552888: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 14:43:13.832407: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 14:43:13.832527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 14:43:13.832605: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 14:43:13.832686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14134 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3080 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6 I0608 14:43:13.833489 140528093734080 train_lib.py:104] Using strategy <class 'tensorflow.python.distribute.one_device_strategy.OneDeviceStrategy'> with 1 replicas I0608 14:43:13.935532 140528093734080 deeplab.py:57] Synchronized Batchnorm is used. I0608 14:43:13.936264 140528093734080 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 32, 'classification_mode': True, 'backbone_type': 'resnet', 'use_axial_beyond_stride': 0, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'keras.layers.normalization.batch_normalization.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0} I0608 14:43:14.012376 140528093734080 deeplab.py:96] Setting pooling size to (33, 65) I0608 14:43:14.012487 140528093734080 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0608 14:43:16.160442 140528093734080 controller.py:391] restoring or initializing model... restoring or initializing model... I0608 14:43:16.160532 140528093734080 controller.py:397] initialized model. initialized model. I0608 14:43:16.680134 140528093734080 api.py:446] Eval with scales ListWrapper([1.0]) I0608 14:43:17.362730 140528093734080 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0608 14:43:17.376272 140528093734080 api.py:446] Eval scale 1.0; setting pooling size to [33, 65] I0608 14:43:19.421471 140528093734080 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0608 14:43:19.684617 140528093734080 controller.py:486] saved checkpoint to Model_Output/ResNet50-DeepLavV3+SemanticSegmentation/ckpt-0. saved checkpoint to Model_Output/ResNet50-DeepLavV3+SemanticSegmentation/ckpt-0. I0608 14:43:19.684976 140528093734080 controller.py:236] train | step: 0 | training until step 60000... train | step: 0 | training until step 60000... 2022-06-08 14:43:19.714981: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2) Traceback (most recent call last): File "/home/prabal/MachineLearning/MLAlgorithms/SemanticSegmentation/deeplab2/trainer/train.py", line 76, in
app.run(main)
File "/home/prabal/anaconda3/lib/python3.9/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/prabal/anaconda3/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/prabal/MachineLearning/MLAlgorithms/SemanticSegmentation/deeplab2/trainer/train.py", line 71, in main
train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master,
File "/home/prabal/MachineLearning/MLAlgorithms/SemanticSegmentation/deeplab2/trainer/train_lib.py", line 190, in run_experiment
controller.train(
File "/home/prabal/MachineLearning/MLAlgorithms/SemanticSegmentation/models/orbit/controller.py", line 240, in train
self._train_n_steps(num_steps)
File "/home/prabal/MachineLearning/MLAlgorithms/SemanticSegmentation/models/orbit/controller.py", line 439, in _train_n_steps
train_output = self.trainer.train(num_steps_tensor)
File "/home/prabal/MachineLearning/MLAlgorithms/SemanticSegmentation/models/orbit/standard_runner.py", line 146, in train
self._train_loop_fn(self._train_iter, num_steps)
File "/home/prabal/anaconda3/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 885, in call
result = self._call(*args, *kwds)
File "/home/prabal/anaconda3/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 950, in _call
return self._stateless_fn(args, **kwds)
File "/home/prabal/anaconda3/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 3039, in call
return graph_function._call_flat(
File "/home/prabal/anaconda3/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 1963, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/prabal/anaconda3/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 591, in call
outputs = execute.execute(
File "/home/prabal/anaconda3/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Input to DecodeRaw has length 16669 that is not a multiple of 4, the size of int32
[[{{node DecodeRaw}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[while/body/_1/IteratorGetNext]]
[[Func/while/body/_1/output_control_node/_3057/_103]]
(1) Invalid argument: Input to DecodeRaw has length 16669 that is not a multiple of 4, the size of int32
[[{{node DecodeRaw}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[while/body/_1/IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_loop_fn_36563]
Function call stack: loop_fn -> loop_fn