sampepose / flownet2-tf

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
MIT License
405 stars 195 forks source link

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. #81

Closed fighttiger25 closed 5 years ago

fighttiger25 commented 5 years ago

Sorry, could anyone explain how to set to run flownet2 on GPU only, in order to fix this bug?

My env is Python2.7, cuda 10, tensorflow 1.12.0, without install cudnn, on Ubutun 18.

BTW, I can run flownet_s, flonet_sd, but fail with the others.

the full log as follows:

WARNING:tensorflow:From src/net.py:22: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step WARNING:tensorflow:From src/flownet_cs/flownet_cs.py:26: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead 2019-02-22 11:33:06.314602: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Traceback (most recent call last): File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/chenyang/Downloads/flownet2-tf-master/src/flownet2/test.py", line 51, in main() File "/home/chenyang/Downloads/flownet2-tf-master/src/flownet2/test.py", line 18, in main out_path=FLAGS.out, File "src/net.py", line 68, in test saver.restore(sess, checkpoint) File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1582, in restore err, "a mismatch between the current graph and the graph") tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'Correlation' with these attrs. Registered devices: [CPU,XLA_CPU], Registered kernels: device='GPU'

 [[node FlowNet2/FlowNetCSS/FlowNetCS/FlowNetC/Correlation (defined at <string>:54)  = Correlation[kernel_size=1, max_displacement=20, pad=20, stride_1=1, stride_2=2, _device="/device:CPU:0"](FlowNet2/FlowNetCSS/FlowNetCS/FlowNetC/conv3/lrelu/add, FlowNet2/FlowNetCSS/FlowNetCS/FlowNetC/conv3_1/lrelu/add)]]

Caused by op u'FlowNet2/FlowNetCSS/FlowNetCS/FlowNetC/Correlation', defined at: File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/chenyang/Downloads/flownet2-tf-master/src/flownet2/test.py", line 51, in main() File "/home/chenyang/Downloads/flownet2-tf-master/src/flownet2/test.py", line 18, in main out_path=FLAGS.out, File "src/net.py", line 62, in test predictions = self.model(inputs, training_schedule) File "src/flownet2/flownet2.py", line 23, in model net_css_predictions = self.net_css.model(inputs, training_schedule, trainable=False) File "src/flownet_css/flownet_css.py", line 18, in model net_cs_predictions = self.net_cs.model(inputs, training_schedule, trainable=False) File "src/flownet_cs/flownet_cs.py", line 18, in model net_c_predictions = self.net_c.model(inputs, training_schedule, trainable=False) File "src/flownet_c/flownet_c.py", line 40, in model cc = correlation(conv_a_3, conv_b_3, 1, 20, 1, 2, 20) File "src/correlation.py", line 14, in correlation padding) File "", line 54, in correlation File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/chenyang/ENTER/envs/flownet2.0/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'Correlation' with these attrs. Registered devices: [CPU,XLA_CPU], Registered kernels: device='GPU'

 [[node FlowNet2/FlowNetCSS/FlowNetCS/FlowNetC/Correlation (defined at <string>:54)  = Correlation[kernel_size=1, max_displacement=20, pad=20, stride_1=1, stride_2=2, _device="/device:CPU:0"](FlowNet2/FlowNetCSS/FlowNetCS/FlowNetC/conv3/lrelu/add, FlowNet2/FlowNetCSS/FlowNetCS/FlowNetC/conv3_1/lrelu/add)]]
Iamanorange commented 5 years ago

see above for traceback

Can you post the above log?

fighttiger25 commented 5 years ago

see above for traceback

Can you post the above log?

I have updated with log, thanks for your help in advance.

Iamanorange commented 5 years ago

No OpKernel was registered to support Op 'Correlation' with these attrs. Registered devices: [CPU,XLA_CPU], Registered kernels: device='GPU' [[node blahblah]]

That means you have an Op 'Correlation' which can only be run on GPU, but tensorflow cann't find a GPU. Have you ever run tensorflow on GPU?

fighttiger25 commented 5 years ago

No OpKernel was registered to support Op 'Correlation' with these attrs. Registered devices: [CPU,XLA_CPU], Registered kernels: device='GPU' [[node blahblah]]

That means you have an Op 'Correlation' which can only be run on GPU, but tensorflow cann't find a GPU. Have you ever run tensorflow on GPU?

Yes, it works before with windows 10 and GTX 1050Ti, but now I change the operation system into Ubuntu 18.04 with the device. The env are Python2.7, cuda 10, tensorflow 1.12.0, without installing cudnn.

Iamanorange commented 5 years ago

You can run this Flownet2 on Windows? That's interesting.

Suppose you haven't run tensorflow on GPU on your Ubuntu system. tf-1.12 requires CUDA 9, while tf-1.13 requires CUDA 10. Make sure that compatible. Also you need to install CUDNN and tensorflow-gpu.

fighttiger25 commented 5 years ago

You can run this Flownet2 on Windows? That's interesting.

Suppose you haven't run tensorflow on GPU on your Ubuntu system. tf-1.12 requires CUDA 9, while tf-1.13 requires CUDA 10. Make sure that compatible. Also you need to install CUDNN and tensorflow-gpu.

No, I mean other programmes. I change to Ubuntu due to the fact that it doesn't work with windows. Anyway, I will try according your suggestions.

Iamanorange commented 5 years ago

Install CUDA 9. Install CUDNN for CUDA 9. Install tensorflow-gpu<=1.12. Then make. You may meet some errors that others have already met. Search from issues first.

fighttiger25 commented 5 years ago

Install CUDA 9. Install CUDNN for CUDA 9. Install tensorflow-gpu<=1.12. Then make. You may meet some errors that others have already met. Search from issues first.

Thanks very much. I can run flownet2.0 now. My env is CUDA 10 locally, and cudatoolkit 9.0, cudnn 7.3.1, and tensorflow-gpu 1.12 in conda env.
However, the flownet2.0 performs very poorly and I can't get the same flow image in readme. What's more, the output image is random with same input images. The log is as following:

(flownet2.0-gpu-v2) chenyang@chenyang-Inspiron-15-7000-Gaming:~/Downloads/flownet2-tf-master$ python -m src.flownet2.test --input_a data/samples/0img0.ppm --input_b data/samples/0img1.ppm --out ./ WARNING:tensorflow:From src/net.py:22: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step WARNING:tensorflow:From src/flownet_cs/flownet_cs.py:26: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead 2019-02-28 19:30:19.330567: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2019-02-28 19:30:19.405113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-02-28 19:30:19.405529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:01:00.0 totalMemory: 3.95GiB freeMemory: 3.64GiB 2019-02-28 19:30:19.405543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-28 19:30:19.629411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-28 19:30:19.629443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-28 19:30:19.629449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-28 19:30:19.629649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3364 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-02-28 19:30:24.964743: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.58GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. max flow: 153.2417 flow range: u = -40.339 .. 22.779 v = -1.128 .. 152.825

(flownet2.0-gpu-v2) chenyang@chenyang-Inspiron-15-7000-Gaming:~/Downloads/flownet2-tf-master$ python -m src.flownet2.test --input_a data/samples/0img0.ppm --input_b data/samples/0img1.ppm --out ./ WARNING:tensorflow:From src/net.py:22: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step WARNING:tensorflow:From src/flownet_cs/flownet_cs.py:26: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead 2019-02-28 19:30:30.091185: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2019-02-28 19:30:30.167325: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-02-28 19:30:30.167742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:01:00.0 totalMemory: 3.95GiB freeMemory: 3.64GiB 2019-02-28 19:30:30.167757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-28 19:30:30.397008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-28 19:30:30.397042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-28 19:30:30.397065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-28 19:30:30.397239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3364 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-02-28 19:30:35.757919: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.58GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. max flow: 37.8530 flow range: u = -23.250 .. 24.644 v = -32.727 .. 33.248

You can see that it has two totally different results of 'max flow' and 'flow range' with same inputs.

Is this due to the 'ran out of memory' warning?

Anyone can help?

Thanks.

fighttiger25 commented 5 years ago

solved by reinstall cuda 9.0, cudnn v7.2.1 locally and edit hardcode.

Thanks everyone who have given me help!