oandrienko / fast-semantic-segmentation

ICNet and PSPNet-50 in Tensorflow for real-time semantic segmentation
220 stars 41 forks source link

Stage 2 - Compression and Retraining #14

Closed JanLin0817 closed 5 years ago

JanLin0817 commented 5 years ago

Hi, I follow the Documentation step by step, from training PSPNet to re-training ICNet . Everything works fine until the last step, When i re-train ICNet after compress ICNet , it shows the problem as below.

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Assign requires shapes of both tensors to match. lhs shape= [1,1,256,3] rhs shape= [1,1,512,3]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@CascadeFeatureFusion_0/AuxOutput/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](CascadeFeatureFusion_0/AuxOutput/weights, save/RestoreV2:1)]]

it seems like after ICNet get compress by filter=0.5, some layer in model can't match anymore. Or maybe this is an issue of tensorflow slim.

Caused by op u'save/Assign_1', defined at:
  File "train_mem_saving.py", line 192, in <module>
    tf.app.run()
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train_mem_saving.py", line 188, in main
    gradient_checkpoints=checkpoint_nodes)
  File "/home/idata/LDM/test/fast-semantic-segmentation/libs/trainer.py", line 217, in train_segmentation_model
    ignore_missing_vars=True)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 689, in assign_from_checkpoint_fn
    write_version=saver_pb2.SaverDef.V1)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1338, in __init__
    self.build()
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1347, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1384, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 835, in _build_internal
    restore_sequentially, reshape)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 494, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 185, in restore
    self.op.get_shape().is_fully_defined())
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 283, in assign
    validate_shape=validate_shape)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 60, in assign
    use_locking=use_locking, name=name)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/idata/anaconda3/envs/fastSS_1/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
oandrienko commented 5 years ago

Hey, I'm glad the documentation has been helpful so far. Seems like there is an issue with the compression script/instructions I provided. From what I remember theCascadeFeatureFusion_*/AuxOutput/* nodes should not be in the pruned output checkpoint.

Can you run python -m tensorflow.python.tools.inspect_checkpoint --file_name <YOU_PRUNED_CHECKPOINT_FILE> and show me the result? I can try to check this out this weekend when I have some time. Feel free to also send me an email.

JanLin0817 commented 5 years ago

Hello, this is my inspect checkpoint result of my pruned checkpoint icnet_pruned.zip. Thanks for your reply.

awiegersma commented 5 years ago

I'm getting the exact same error. Is there an update on this yet?

julienip commented 5 years ago

I'm getting the exact same error. Is there an update on this yet?

when running compress.py, I have KeyError: 'Predictions/postrain/Conv2D' How did you solve this?

Edit : I retrained everything with only one GPU and the error is gone. I have the same issues now

oandrienko commented 5 years ago

@julienip @awiegersma @JanLin0817 Hey all, I am really sorry for the complete lack of reply on this thread for the last while. Totally my fault. I had been working for the last few months and had a huge lack of time and was also not able to contribute to any open source.

Now that I am finished, I went through everything and provided a major update through 42c6bbe. This should fix all the bugs with the compression script and also provides an update to the dataset and preprocessor builders.

@JanLin0817 The issue - which I completely forgot to document - is that the export script must be run before the compression script in order to generate Tensorflow checkpoints that do not have training nodes. In your specific case, the AuxOutput node is only added during training and so it is missing from the prune config (and thus does not get pruned resulting in the shape mismatch). Removal of all the training nodes through the export script is required to make walking through the graph during compression simple. I have updated the documentation here so hopefully no one else runs into this issue.

@julienip In regards to the key-error, it looks like there were also some problems with the compression configs I had uploaded. I have also fixed this in the latest merge to master.

As a note - the variable names have changed in the latest update to avoid the weird Prediction/postrain vsPrediction/pretrain convention I had when naming the PSPNet and ICNet output nodes. This is reflected in the updated models in the Model Zoo. If you have your own older checkpoints and want to use the updated codebase, you can also rename the nodes in your Tensorflow Checkpoints like I did. I used a simple name conversion script to do this which I found here. I rename CascadeFeatureFusion_0 -> CascadeFeatureFusion and *Predictions/postrain -> Predictions/Conv (for fine-tuning from PSPNEt, ICNet ignores all `Predictions` nodes)**.

I did some quick testing of the whole PSPNet/ICNet pipeline, but if you guys find any more bugs please let me know (or even submit a PR if you can). Sorry again for the frustration and please let me know if this helps. I will be quicker to reply now.