tensorflow / models

Models and examples built with TensorFlow
Other
77.05k stars 45.77k forks source link

deeplab v3+training ,the mask is all black #9096

Open lxyzler opened 4 years ago

lxyzler commented 4 years ago

I use pascal_voc_seg to train deeplabv3+ without tf_initial_checkpoint but predict nothing,the mask is all black I‘m sure the training datasets is correct,mask data is uint8

Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import cv2 img=cv2.imread('2008_003626.png') import numpy as np np.unique(img) array([ 0, 15], dtype=uint8)

if use deeplabv3_mnv2_pascal_trainval initial model ,the results is normal what's the problem

ravikyram commented 4 years ago

@lxyzler

can you please check this link and see if it helps you. Request you to share colab link or code snippet to reproduce the issue in our environment.It helps us in localizing the issue faster.Thanks!

lxyzler commented 4 years ago

@lxyzler

can you please check this link and see if it helps you. Request you to share colab link or code snippet to reproduce the issue in our environment.It helps us in localizing the issue faster.Thanks!

My label img have been converted to grayscale 2008_006081

I use the official deeolabv+3 code If theofficial pre training model is loaded, it can train normally, and the result is normal But I want to use pascal to train my pre training model, and the prediction result is black Here is my parameter configuration: flags.DEFINE_integer('num_clones', 1, 'Number of clones to deploy.')

flags.DEFINE_boolean('clone_on_cpu', False, 'Use CPUs to deploy clones.')

flags.DEFINE_integer('num_replicas', 1, 'Number of worker replicas.')

flags.DEFINE_integer('startup_delay_steps', 15, 'Number of training steps between replicas startup.')

flags.DEFINE_integer( 'num_ps_tasks', 0, 'The number of parameter servers. If the value is 0, then ' 'the parameters are handled locally by the worker.')

flags.DEFINE_string('master', '', 'BNS name of the tensorflow server')

flags.DEFINE_integer('task', 0, 'The task ID.') flags.DEFINE_string('train_logdir', './model_voc/', 'Where the checkpoint and logs are stored.')

flags.DEFINE_integer('log_steps', 10, 'Display logging information at every log_steps.')

flags.DEFINE_integer('save_interval_secs', 1200, 'How often, in seconds, we save the model to disk.')

flags.DEFINE_integer('save_summaries_secs', 600, 'How often, in seconds, we compute the summaries.')

flags.DEFINE_boolean( 'save_summaries_images', False, 'Save sample inputs, labels, and semantic predictions as ' 'images to summary.')

flags.DEFINE_string('profile_logdir', None, 'Where the profile files are stored.')

flags.DEFINE_enum('optimizer', 'momentum', ['momentum', 'adam'], 'Which optimizer to use.')

flags.DEFINE_enum('learning_policy', 'poly', ['poly', 'step'], 'Learning rate policy for training.')

flags.DEFINE_float('base_learning_rate', 3e-5, 'The base learning rate for model training.')

flags.DEFINE_float('decay_steps', 0.0, 'Decay steps for polynomial learning rate schedule.')

flags.DEFINE_float('end_learning_rate', 0.0, 'End learning rate for polynomial learning rate schedule.')

flags.DEFINE_float('learning_rate_decay_factor', 0.1, 'The rate to decay the base learning rate.')

flags.DEFINE_integer('learning_rate_decay_step', 2000, 'Decay the base learning rate at a fixed step.')

flags.DEFINE_float('learning_power', 0.9, 'The power value used in the poly learning policy.')

flags.DEFINE_integer('training_number_of_steps', 300000, 'The number of steps used for training') flags.DEFINE_float('momentum', 0.9, 'The momentum value to use')

flags.DEFINE_float('adam_learning_rate', 0.001, 'Learning rate for the adam optimizer.') flags.DEFINE_float('adam_epsilon', 1e-08, 'Adam optimizer epsilon.') flags.DEFINE_integer('train_batch_size', 16, 'The number of images in each batch during training.') flags.DEFINE_float('weight_decay', 0.00004, 'The value of the weight decay for training.')

flags.DEFINE_list('train_crop_size', '513,513', 'Image crop size [height, width] during training.')

flags.DEFINE_float( 'last_layer_gradient_multiplier', 1.0, 'The gradient multiplier for last layers, which is used to ' 'boost the gradient of last layers if the value > 1.')

flags.DEFINE_boolean('upsample_logits', True, 'Upsample logits during training.')

flags.DEFINE_float( 'drop_path_keep_prob', 1.0, 'Probability to keep each path in the NAS cell when training.')

flags.DEFINE_string('tf_initial_checkpoint',None, 'The initial checkpoint in tensorflow format.')

flags.DEFINE_boolean('initialize_last_layer', False, 'Initialize the last layer.')

flags.DEFINE_boolean('last_layers_contain_logits_only', False, 'Only consider logits as last layers or not.')

flags.DEFINE_integer('slow_start_step', 0, 'Training model with small learning rate for few steps.')

flags.DEFINE_float('slow_start_learning_rate', 1e-4, 'Learning rate employed during slow start.')

flags.DEFINE_boolean('fine_tune_batch_norm', True, 'Fine tune the batch norm parameters or not.')

flags.DEFINE_float('min_scale_factor', 0.5, 'Mininum scale factor for data augmentation.')

flags.DEFINE_float('max_scale_factor', 2., 'Maximum scale factor for data augmentation.')

flags.DEFINE_float('scale_factor_step_size', 0.25, 'Scale factor step size for data augmentation.')

flags.DEFINE_multi_integer('atrous_rates', None, 'Atrous rates for atrous spatial pyramid pooling.')

flags.DEFINE_integer('output_stride', 16, 'The ratio of input to output spatial resolution.')

flags.DEFINE_integer( 'hard_example_mining_step', 0, 'The training step in which exact hard example mining kicks off. Note we ' 'gradually reduce the mining percent to the specified ' 'top_k_percent_pixels. For example, if hard_example_mining_step=100K and ' 'top_k_percent_pixels=0.25, then mining percent will gradually reduce from ' '100% to 25% until 100K steps after which we only mine top 25% pixels.')

flags.DEFINE_float( 'top_k_percent_pixels', 1.0, 'The top k percent pixels (in terms of the loss values) used to compute ' 'loss during training. This is useful for hard pixel mining.')

flags.DEFINE_integer( 'quantize_delay_step', -1, 'Steps to start quantized training. If < 0, will not quantize model.') flags.DEFINE_string('dataset', 'pascal_voc_seg', 'Name of the segmentation dataset.')

flags.DEFINE_string('train_split', 'train', 'Which split of the dataset to be used for training')

flags.DEFINE_string('dataset_dir', 'tfrecord_voc/', 'Where the dataset reside.')

I tried to modify the learning rate, but it didn't work

I used model.ckpt-0 to predict,the result has other colors 2008_005376 After about 2000 steps, the predictions are all black

The loss of training is as follows: I0805 10:06:41.963682 140353793406720 supervisor.py:1050] Recording summary at step 0. I0805 10:06:45.311434 140372148655936 learning.py:507] global step 10: loss = 3.0091 (0.261 sec/step) I0805 10:06:47.986760 140372148655936 learning.py:507] global step 20: loss = 2.9651 (0.262 sec/step) I0805 10:06:50.666162 140372148655936 learning.py:507] global step 30: loss = 2.8692 (0.258 sec/step) I0805 10:06:53.431875 140372148655936 learning.py:507] global step 40: loss = 2.8412 (0.284 sec/step) I0805 10:06:56.152063 140372148655936 learning.py:507] global step 50: loss = 2.7127 (0.266 sec/step) I0805 10:06:58.863685 140372148655936 learning.py:507] global step 60: loss = 2.7303 (0.269 sec/step) I0805 10:07:01.630717 140372148655936 learning.py:507] global step 70: loss = 2.4736 (0.275 sec/step) I0805 10:07:04.425113 140372148655936 learning.py:507] global step 80: loss = 2.4391 (0.286 sec/step) I0805 10:07:07.155676 140372148655936 learning.py:507] global step 90: loss = 2.4280 (0.276 sec/step) I0805 10:07:09.966064 140372148655936 learning.py:507] global step 100: loss = 2.4852 (0.268 sec/step) 。。。。。。。。。。。。。。。 I0814 16:33:50.550069 140685764171520 supervisor.py:1117] Saving checkpoint to path ./model_voc/model.ckpt I0814 16:33:50.866480 140685780956928 supervisor.py:1050] Recording summary at step 190737. I0814 16:33:52.291256 140704146675520 learning.py:507] global step 190740: loss = 1.2860 (0.515 sec/step) I0814 16:33:57.147122 140704146675520 learning.py:507] global step 190750: loss = 1.3965 (0.517 sec/step) I0814 16:34:02.050992 140704146675520 learning.py:507] global step 190760: loss = 1.5782 (0.500 sec/step) I0814 16:34:06.947767 140704146675520 learning.py:507] global step 190770: loss = 1.6702 (0.514 sec/step) I0814 16:34:11.845656 140704146675520 learning.py:507] global step 190780: loss = 1.3798 (0.466 sec/step) I0814 16:34:16.659204 140704146675520 learning.py:507] global step 190790: loss = 1.3753 (0.481 sec/step) I0814 16:34:21.477733 140704146675520 learning.py:507] global step 190800: loss = 1.6170 (0.456 sec/step)

lxyzler commented 4 years ago

@ravikyram @jvishnuvardhan Do you have any ideas, or what should I pay attention to while training without loading the pre training model

Winnie1128 commented 3 years ago

@lxyzler Hello, have you solved this problem? I met the same situation.

Well, I finally solved the black mask problem by changing last_layer_gradient_multiplier to 2, last_layers_contain_logits_only to True. The imbalance of loss weight in the newest version locate at about line267 loss_weight.

Thunder003 commented 3 years ago

I'm also facing the same issue. I've tried solution mentioned by @Winnie1128 but is not working. I faced this issue earlier also but after few attempts without changing anything it worked but now it's not working. Any other solution? This issue originated when I used pretrained xception_65 network. But now it's showing black mask in my first time trained network as well( which was working fine earlier)

Jackbrocp commented 2 years ago

Have you solved this problem? I train on original PASCAL VOC dataset without using per-trained model, the predicted mask is all black and eval result is all 0(mIOU).

DISAPPEARED13 commented 1 year ago

I am confused that I have met the same problem, even the gradient didn't update. The first iteration, I got the all-zero probability map, means that each class including the background share the same probability. Could anyone please advice me how to start to solve this problem please... I have no idea, I've checked the input and the preprocessing, this error should not happened... If I need to change the network frame.