About paper reproduction problem

Stonebobo commented 4 years ago

Hello, I want to ask some questions.

The model I trained before loading in the training set does not have problems, but loading the trained model in the test set will report an error:(我在train_MSPFN导入上次训练结果不报错，但是在testMSPFN上导入训练结果就会报错) During handling of the above exception, another exception occurred:a Variable name or other graph key that is missing_ detail are as follows: Traceback (most recent call last): File "E:/bwl_python/MSPFN-me-7.3/model/test/test_MSPFN.py", line 48, in saver.restore(sess, '../MSPFN/epoch6')#93 File "E:\anaconda\path\envs\bwltfgpu\lib\site-packages\tensorflow\python\training\saver.py", line 1302, in restore err, "a Variable name or other graph key that is missing") tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: 2 root error(s) found. (0) Not found: Key generator/BCM2_0/down2_1/alpha not found in checkpoint [[node save/RestoreV2 (defined at /bwl_python/MSPFN-me-7.3/model/test/test_MSPFN.py:47) ]] [[save/RestoreV2/_453]] (1) Not found: Key generator/BCM2_0/down2_1/alpha not found in checkpoint [[node save/RestoreV2 (defined at /bwl_python/MSPFN-me-7.3/model/test/test_MSPFN.py:47) ]] 0 successful operations. 0 derived errors ignored.
In addition, what is the specific version of your tensorflow? 1.1? 1.14? (Ps, I have some version errors when using 1.1),such as AttributeError: module 'tensorflow' has no attribute 'AUTO_REUSE' when I use the tensorflow1.1
when the epoch=5(batch size=12,input_image is 480*320),the train_loss and the edge_loss are not change. train_loss=0.00105,the edge_loss=0.0010004.........

looking forward to your reply, thank you!

kuijiang94 commented 4 years ago

Q1: The tensorflow version, the memory space, or the graph will cause this problem. You can solve this problem by referring to https://blog.csdn.net/u010327061/article/details/84078583. Q2: The version of the tensorflow is 1.12 with cuda 9.0. Q3: How much training samples do you adopt and how to set the learning rate?

Stonebobo commented 4 years ago

Tank you for your reply. The number of my training set is 3600(481*321) from Rain100L_new_version of CVPR 2017 the learning rate has not changed, the original value in your code is used--start_learning_rate = 5e-4# Q3 may be caused by the data set is too small than you paper used. thank you!I will try again about Q1.

kuijiang94 / MSPFN

About paper reproduction problem #3