tensorflow / models

Models and examples built with TensorFlow
Other
76.95k stars 45.8k forks source link

[deeplab] Training deeplab model with ADE20K dataset #3730

Open walkerlala opened 6 years ago

walkerlala commented 6 years ago

System information

Describe the problem

This is a feature request. I am trying to train the deeplab model with the ADE20K dataset (see this presentation). I have finished the data format conversion and "successfully" train the model on a small subset of ADE20K. Below is the modification to file research/deeplab/datasets/segmentation_dataset.py which is used to extract segmentation data.

diff --git a/research/deeplab/datasets/segmentation_dataset.py b/research/deeplab/datasets/segmentation_dataset.py
index a777252..8648fb2 100644
--- a/research/deeplab/datasets/segmentation_dataset.py
+++ b/research/deeplab/datasets/segmentation_dataset.py
@@ -85,10 +85,22 @@ _PASCAL_VOC_SEG_INFORMATION = DatasetDescriptor(
     ignore_label=255,
 )

+_ADE20K_INFORMATION = DatasetDescriptor(
+    splits_to_sizes = {
+        'train': 40,
+        'val': 5,
+    },
+    # TODO temporarily change it to 21 otherwise dimension mismatch
+    num_classes=21,
+    ignore_label=255,
+)
+

 _DATASETS_INFORMATION = {
     'cityscapes': _CITYSCAPES_INFORMATION,
     'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
+    'ade20k': _ADE20K_INFORMATION,
 }

 # Default file pattern of TFRecord of TensorFlow Example.

The problem is, in the ADE20K dataset there are 150 classes, which is different from that in the VOC or cityspace dataset. That brings problem w.r.t the checkpoint file. Currently there are only pretrained model on the VOC and cityspace dataset. So we have two choices here:

  1. Do not use the checkpoint file. In this case, there is an error:

    absl.flags._exceptions.IllegalFlagValueError: flag --tf_initial_checkpoint=None: Flag --tf_initial_checkpoint must be specified.
  2. set num_classes=21 to use those two provided checkpoint files

Are there any alternatives to these?

If anyone have any workable solution for the ADE20K dataset it would be really appreciated.

aquariusjay commented 6 years ago
  1. You could modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False. (Note you still want to restore the variables in ASPP, decoder and so on). By doing so, only the weights in the last classification layer is not initialized (then you could use a classification layer with 150 classes).

  2. You need to explore the min_resize_value and max_resize_value (set resize_factor = output_stride) for ADE20K which contains images of huge various scales (e.g., dimension ranges from 50 to 2000). In that case, by setting min_resize_value and max_resize_value, you are able to resize the images on-the-fly to the similar range (or you could do that manually by yourself while pre-processing the dataset). Note however these hyper-parameters may affect the performance, and we have not yet explored that carefully.

walkerlala commented 6 years ago

@aquariusjay Thanks for the hints. Now I have started the training, using the provided VOC model checkpoint, setting fine_tune_batch_norm to False, using the mobilenet_v2 variant and a batch size of 8. Hopefully that the loss will drop after several hours...

There are still two things confusing me:

  1. the segmentation annotation images within the ADE20K dataset have trhee channels, but I am reading it with label_reader = build_data.ImageReader('png', channels=1) , as for what we have done for the VOC dataset (in datasets/build_voc2012_data.py). Will that be a problem?

  2. why do we have the resize_factor parameters?

walkerlala commented 6 years ago

Oh, will it be OK to prepare a pull request for the ADE20K dataset?

aquariusjay commented 6 years ago

Regarding your previous questions:

  1. The groundtruth images should contain only 1 channel with values = semantic labels.
  2. You could check the code for details.

We currently do not have any plan to prepare that. However, note that one should be able to do that by using the provided code/model/script. Also, any contributions for extra dataset to the codebase is welcome.

Cheers,

brett-whitford commented 6 years ago

@aquariusjay,

I'm currently having similar issues attempting to train with a custom dataset and was hoping you could offer some insight.

You could modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False.

The link you included "here" appears to need a Google SSO to login. I am assuming that was a link to the _trainutil.py script. Here are the changes I have currently made to implement your architecture on my custom dataset:

  1. _segmentationdataset.py
_TOY_DATASET_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train': 800,
        'trainval': 1000,
        'val': 200,
    },
    num_classes=10,
    ignore_label=255,
)

_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'toy_dataset': _TOY_DATASET_INFORMATION,
}
  1. train.py
flags.DEFINE_boolean('initialize_last_layer', False,
                     'Initialize the last layer.')

flags.DEFINE_string('dataset', 'toy_dataset',
                    'Name of the segmentation dataset.')
  1. _trainutils.py
  exclude_list = ['_LOGITS_SCOPE_NAME']
  if not initialize_last_layer:
    exclude_list.extend(last_layers)
  1. eval.py
flags.DEFINE_string('dataset', 'toy_dataset',
                    'Name of the segmentation dataset.')

However, when I run this my code appears to successfully train, but then running into an issues with the the confusion matrix during evaluation (I include the traceback below for reference). Any tips/suggestions on how to fix this?

Thanks for your help! Brett

Error Traceback:

~/brett/wss-python/models/research/deeplab$ sh local_test_custom.sh 
Converting toy dataset...
>> Converting image 50/200 shard 0
>> Converting image 100/200 shard 1
>> Converting image 150/200 shard 2
>> Converting image 200/200 shard 3
>> Converting image 250/1000 shard 0
>> Converting image 500/1000 shard 1
>> Converting image 750/1000 shard 2
>> Converting image 1000/1000 shard 3
>> Converting image 200/800 shard 0
>> Converting image 400/800 shard 1
>> Converting image 600/800 shard 2
>> Converting image 800/800 shard 3
--2018-03-30 12:33:03--  http://download.tensorflow.org/models/deeplabv3_pascal_train_aug_2018_01_04.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.8.176, 2607:f8b0:4009:80d::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.8.176|:80... connected.
HTTP request sent, awaiting response... 416 Requested range not satisfiable

    The file is already fully retrieved; nothing to do.

toy_dataset
INFO:tensorflow:Training on trainval set
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
INFO:tensorflow:Ignoring initialization; other checkpoint exists
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Restoring parameters from /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-11
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 11.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
toy_dataset
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 200
INFO:tensorflow:Eval batch size 1 and num batch 200
INFO:tensorflow:Waiting for new checkpoint at /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train
INFO:tensorflow:Found new checkpoint at /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-12
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-12
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting evaluation at 2018-03-30-16:35:58
Traceback (most recent call last):
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 175, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 168, in main
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py", line 452, in evaluate_repeatedly
    session.run(eval_ops, feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [255 255 255...] [y (mean_iou/ToInt64_2:0) = ] [10]
     [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2)]]

Caused by op u'mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert', defined at:
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 175, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 142, in main
    predictions, labels, dataset.num_classes, weights=weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py", line 1009, in mean_iou
    num_classes, weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py", line 263, in _streaming_confusion_matrix
    labels, predictions, num_classes, weights=weights, dtype=dtypes.float64)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/confusion_matrix.py", line 183, in confusion_matrix
    message='`predictions` out of bound')],
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 579, in assert_less
    return control_flow_ops.Assert(condition, data, summarize=summarize)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 177, in Assert
    guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2027, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1868, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 175, in true_assert
    condition, data, summarize, name="Assert")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 48, in _assert
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [255 255 255...] [y (mean_iou/ToInt64_2:0) = ] [10]
     [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2)]]
walkerlala commented 6 years ago
  1. train_utils.py

    • I modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME', as you stated above.

    exclude_list = ['_LOGITS_SCOPE_NAME'] if not initialize_last_layer: exclude_list.extend(last_layers)

this should be

exclude_list = [_LOGITS_SCOPE_NAME]

That is, _LOGITS_SCOPE_NAME is a variable defined else where (search for it)

wonderit commented 6 years ago

@walkerlala

I am trying to train the deeplab model with the ADE20k datasets. I'm having some problem with data format conversion.
Would you mind sharing the code for ADE20k datasets? It would be really appreciated.

shipengai commented 6 years ago

@brett-whitford When I use my data .I have the same error with you . Can you share your solution? Thank you very much .I 'm looking forword to your reply

walkerlala commented 6 years ago

@wonderit Of course. Please wait for a while until I have access to my GPU server.

walkerlala commented 6 years ago

@wonderit Here is the patch for converting training data and training deeplabv3 on ADE20K.

https://gist.github.com/walkerlala/82d978e68407e65158e8825cd470d7e1

(it can also be found at http://fastdrivers.org/misc/patch-for-ade20k.patch )

You can apply this patch on top of commit 1d38a22535866f2e19a4eb0fc623fa768fb08dcf or 5281c9a028f6fc344357c2c9e0c06c171e16dfa4 without conflict.

Note:

  1. you can to manually adjust the path in train_ade20k.py for training and supply correct path of the training data for converting the data, as documented in the doc

  2. training data can be found at: http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip

I am also going to submit a PR to get these into the repo. However, I don't have enough GPU to get a good pretrained model (only get two Nvidia 1080...) If you can obtain a decent pretrained model, please share!

walkerlala commented 6 years ago

Also, anyone interested in add ADE20K to deeplabv3 can take a look at this PR I just created: https://github.com/tensorflow/models/pull/3853

shipengai commented 6 years ago

@walkerlala When use val.py, did you have the error 'predictions' out of bound?just same with the @brett-whitford ' question. Thank you

shipengai commented 6 years ago

@walkerlala Can you share your eval script?

hhwxxx commented 6 years ago

@walkerlala @aquariusjay Hi, I am confused about the exclude_list and initialize_last_layer.

I am not sure whether I understand it correctly: If one want to fine-tune deeplab-v3+ on another dataset, only _LOGITS_SCOPE_NAME need to be excluded?

If so, following @aquariusjay 's suggestion, in "train_utils.py":

exclude_list = [_LOGITS_SCOPE_NAME]
if not initialize_last_layer:
    exclude_list.extend(last_layers)

if set initialize_last_layer=false, then exclude_list will include the last_layers. In "train.py" last_layers is the list [_LOGITS_SCOPE_NAME, _IMAGE_POOLING_SCOPE, _ASPP_SCOPE, _CONCAT_PROJECTION_SCOPE, _DECODER_SCOPE, ]. So all variables in the list will be excluded. This seems inconsistent.

Shouldn't it be the following? initialize_last_layer=true and exclude_list = [_LOGITS_SCOPE_NAME]

lydialixia commented 6 years ago

Hi, I'm training on my own dataset as well (only two classes).

When I set initialize_last_layer=false and

exclude_list = ['logits']
if not initialize_last_layer:
    exclude_list.extend(last_layers)

Then when I run vis.py, it gives me all black images (not binary).

When I only set initialize_last_layer=false, I got binary images (result is not good, but at least show some learning). But it gives me this when run train.py:

INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 6390723.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

when training_number_of_steps=100000

Anyone knows why this happens? Thanks!

hhwxxx commented 6 years ago

@lydialixia Hello. You should add 'global_step' in exclude_list:

exclude_list = ['global_step']

But I am still confused about whether one should set initialize_last_layer=false when to fine-tune deeplab-v3+ on another task.

aquariusjay commented 6 years ago

When you want to fine-tune DeepLab on other datasets, there are a few cases:

  1. You want to re-use ALL the trained weigths: set initialize_last_layer = True (last_layers_contain_logits_only does not matter in this case).

  2. You want to re-use ONLY the network backbone (i.e., exclude ASPP, decoder and so on): set initialize_last_layer = False and last_layers_contain_logits_only = False.

  3. You want to re-use ALL the trained weights EXCEPT the logits (since the num_classes may be different): set initialize_last_layer = False and last_layers_contain_logits_only = True.

georgosgeorgos commented 6 years ago

Hi @walkerlala: did you manage to finetune the ADE20K dataset? I'm trying to finetune on a dataset of the same size, but without success: after the first ~2K iterations the loss stops to decrease and starts to oscillate (~20K iterations). I tried different learning rates, removed the regularization, but for the moment no improvement.

walkerlala commented 6 years ago

@georgosgeorgos No I can't eventually fine tune the model on ADE20K dataset. I don't have enough GPU. Every time I try to fine tune the batch normalization parameters the model blow up throwing out out-of-memory error. So I freeze the batch normalization layers when training. Finally I only got a model with only "modest" performance:

Here is the original image (too large to display here): http://www.fastdrivers.org/misc/stuffseg-origin.jpg

Here is the segmentation result: result

However I can get a satisfying result with PSPNet:

mmexport_1_473_seg

According to the slides from the 2017 Coco + Places Workshop, deeplabv3 should also be able to do that, but I haven't got any luck to fine-tune that. Hopefully Google can provide a fine-tuned pre-trained model in the future @aquariusjay .

cfosco commented 6 years ago

@brett-whitford - Hi Brett, I am having the exact same problem as you. How did you end up solving it?

cfosco commented 6 years ago

@shipeng-uestc - Hi shipeng, did you manage to solve the issue? I am currently using exclude_list=[_LOGITS_SCOPE_NAME] with _LOGITS_SCOPE_NAME imported from deeplab.model as @walkerlala suggested but I am still having the same error as Brett.

jiyongma commented 6 years ago

when I run python deeplab/eval.py --logtostderr --eval_split="val" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --eval_crop_size=513 --eval_crop_size=513 --dataset="ade20k" --checkpoint_dir="./deeplab/datasets/ADE20K/exp/train_on_train_set/train" --eval_logdir="./deeplab/datasets/ADE20K/exp/train_on_train_set/eval" --dataset_dir="./deeplab/datasets/ADE20K/tfrecord"

NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]] [[Node: save/RestoreV2/_299 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]] please help me !!!thanks

qmy612 commented 6 years ago

@hhwxxx Hello, in your answer to lydialixia, do you mean in train_util.py, exclude_list should be like this: exclude_list = ['global_step'] exclude_list = ['logits']

but I still can't start training, the information is: INFO:tensorflow:Starting Queues. INFO:tensorflow:global_step/sec: 0 INFO:tensorflow:Recording summary at step 30000. INFO:tensorflow:Stopping Training. INFO:tensorflow:Finished training! Saving model to disk.

I have also tried exclude_list = ['_LOGITS_SCOPE_NAME'], this doesn't work. When just set exclude_list = ['global_step'], the model will achieve mean iu = 0.93 after 10000 iteractions, I don't know whether this is wrong. Waitting online, thank you!

hhwxxx commented 6 years ago

@qmy612

Hello. Maybe you can try this: exclude_list = ['global_step', 'logits']

As to the _LOGITS_SCOPE_NAME, it is defined in "model.py", so you should use like this: model._LOGITS_SCOPE_NAME.

And I have no idea about miou=0.93.

BeSlower commented 6 years ago

Just set set initialize_last_layer = False and last_layers_contain_logits_only = True works for me, if you wanna train on your own dataset with different num classes.

holyprince commented 6 years ago

@BeSlower , yes, the solution is work for me but there is another problem that the result is all black and no other label , but during the training process , the loss is decrease. Can anyone help me ?

xianshunw commented 6 years ago

@qmy612 Did you get the problem solved? I am having the exacting problem as you

qmy612 commented 6 years ago

@xiangjinwu Yes, the answer of hhwxxx is work. exclude_list = ['global_step', 'logits']

Soulempty commented 6 years ago

@aquariusjay Hello,I train my own dataset which has only one class(exclude unlabeled)and has the same style with the cityscapes on deeplab,but some problems usually happen. One is the server always restart when training. Another is the result is only one color of the class I labeled. Can you give me some advice?Thanks.

xianshunw commented 6 years ago

@qmy612 Thx a lot, It works

aquariusjay commented 6 years ago

@Soulempty, Regarding your questions:

  1. I have no idea about what do you mean by the server always restart. Could you please provide more details such as logs?
  2. In your case, the data samples may be strongly biased to one of the classes. That is why the model only predicts one class in the end. To handle that, I would suggest using larger loss_weight for the under-sampled class (i.e., that class that has fewer data samples). You could modify the weights in line 72 by doing something like weights = tf.to_float(tf.equal(scaled_labels, 0)) label0_weight + tf.to_float(tf.equal(scaled_labels, 1)) label1_weight + tf.to_float(tf.equal(scaled_labels, ignore_label)) * 0.0 where you need to tune the label0_weight and label1_weight (e.g., set label0_weight=1 and increase label1_weight).
Soulempty commented 6 years ago

@aquariusjay Thank you for your detailed solution,I want give you more details about my problems. 1、My dataset is modified as the style of Cityscapes,but have only one class("road"),so the ground truth label only have road pixel and ground pixel(not be labelled). 2、The follow is my ground truth label. 17_nov00100472_gtfine_labeltrainids 3、The follow is my json label. {"imgWidth": 1280, "imgHeight": 1080, "objects": [{"label": "road", "polygon": [[1.0, 612.0], [0.0, 953.0], [407.1, 965.1], [711.0, 963.4], [1094.2, 970.3], [1147.7, 963.4], [1185.9, 961.7], [1279.1, 969.9], [1279.0, 696.0], [918.7, 584.6], [881.0, 573.1], [837.4, 561.6], [821.4, 564.1], [795.0, 565.4], [769.2, 565.2], [769.8, 589.9], [763.2, 600.3], [716.7, 603.5], [706.3, 601.4], [703.5, 578.0], [709.2, 566.3], [702.5, 565.2], [697.8, 573.7], [682.6, 571.6], [671.2, 574.8], [666.5, 579.1], [660.8, 582.2], [632.4, 582.2], [624.8, 580.1], [619.5, 569.3], [422.2, 582.2], [427.8, 613.5], [426.5, 646.0], [418.9, 654.5], [367.2, 664.5], [355.9, 667.3], [258.7, 665.9], [247.3, 664.5], [233.4, 640.4], [227.0, 598.3]]}]} 4、the follow is part of Cityscapes' label script. labels = [

name id trainId category catId hasInstances ignoreInEval color

Label(  'unlabeled'            ,  0 ,      255 , 'void'            , 0       , False        , True         , (  0, 0,0) ),
Label(  'road'                 ,  1 ,        1 , 'flat'            , 1       , False        , False        , (128, 64,128) ),

]

Soulempty commented 6 years ago

the picture is the result of prediction,the colour is the colour of road,but no ground color.

000002_prediction

shanyucha commented 6 years ago

@aquariusjay I got black images when using the default loss_weight. By setting the loss_weight my problem is solved since my data are composed of imbalance datas.

Soulempty commented 6 years ago

@aquariusjay Hello,When I train my dataset which has only one class(the label is "road") and set the background to unlabeled,but get the same loss 0.2622. Can you give some advice on how to train the dataset with one class? I think this is important for some other persons.Thank you. the following is some details: image image screenshot from 2018-05-10 02-58-20

screenshot from 2018-05-10 02-56-27

aquariusjay commented 6 years ago

@Soulempty You question is not related to this issue (ADE20K). Could you please open a new one so that people who have similar experience could share (e.g., @shanyucha)? As I do not have access to your dataset, and it usually takes experimental experience to tune the hyper-parameters.

Soulempty commented 6 years ago

Thank you,I think I solve the problem how to train dataset with one class with your first advice's inspiration.

parachutel commented 6 years ago

@brett-whitford To solve this problem you could inspect the maximum pixel value in the pre-processed gray scale images (after being processed by remove_gt_colormap.py). Your num_classes should be greater than the max pixel value in the images.

RomRoc commented 6 years ago

I retrained deeplab with Ade20K dataset in my Google Colab notebook, below results with MobileNet-v2 and Xception_65 as initial checkpoint, anyway I couldn't fine tune because of OOM error. May be others can share parameters for training to get better results?

MobileNet-v2 ade20k-mobile-2000iter-2batch

Xception_65 ade20k-xception-2000iter-2batch

GWwangshuo commented 6 years ago

@Soulempty Could you please share more your details about how to train custom dataset with only one class ? I really appreciate it. Thanks!

Soulempty commented 6 years ago

just as the details I show above,but set the trainId of unlabelled to 1.

GWwangshuo commented 6 years ago

@Soulempty Thanks. I still feel confused since I have no idea what the label variable is and where can I find it. 39856204-04da98e8-53fd-11e8-9876-0165c575b0e7

Soulempty commented 6 years ago

the ground truth label

GWwangshuo commented 6 years ago

Where can I find it? Thanks a lot.

On Thu, May 17, 2018 at 10:36 AM, Chao Jiao notifications@github.com wrote:

the ground truth label

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/3730#issuecomment-389725708, or mute the thread https://github.com/notifications/unsubscribe-auth/AGoHF90KvohW64gxjBN-pul3lKZC7wVBks5tzOIsgaJpZM4S5uRw .

GWwangshuo commented 6 years ago

@lydialixia could you please share more detailed tutorial about how to train custom dataset with two classes?

GWwangshuo commented 6 years ago

@Soulempty I am sorry that I still cannot figure out how to train custom dataset with two classes. Could you give a tutorial about how to do it ? Thanks very much!

Soulempty commented 6 years ago

my dataset have the same style with cityscapes.what is your data like?

GWwangshuo commented 6 years ago

@Soulempty Thanks for your reply. My dataset is from Kaggle, https://www.kaggle.com/c/ultrasound-nerve-segmentation/data.

This dataset totally contains 5635 image. (I split this dataset to trianing set with 4000 images and validation set 1635 images)

Origin Image and its corresponding mask are shown below:

image image

I have changed images in training set to with extension .jpg and images in validation set to .png. Then I save them as the style of VOC2012 which is show

image

Then, I follow the tutorial of @brettkoonce, but it seems there are something wrong with the training procedure.

urgonguyen commented 6 years ago

@RomRoc I am retraining on ADE20K too. May be the link to download dataset has changed (http://groups.csail.mit.edu/vision/datasets/ADE20K/), right? Could you share for me some thing you change in code to retrain ADE20K Thanks

RomRoc commented 6 years ago

@urgonguyen check here my jupyter notebook that runs in Google Colab. To download ADE20k and convert it you should use download_and_convert_ade20k.sh script.