princeton-vl / DeepV2D

BSD 3-Clause "New" or "Revised" License
653 stars 93 forks source link

Can't run demo with batch size 8 #10

Open roxanneluo opened 4 years ago

roxanneluo commented 4 years ago

Hi I was trying to run the demo python demos/demo_v2d.py --model=models/scannet.ckpt --sequence=data/demos/scannet_0 But got the following error

2020-02-27 14:07:27.062479: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasGemmBatchedEx: CUBLAS_STATUS_NOT_SUPPORTED
2020-02-27 14:07:27.062517: E tensorflow/stream_executor/cuda/cuda_blas.cc:2574] Internal: failed BLAS call, see log for details
Traceback (most recent call last):   
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[134400,2,3], b.shape=[134400,3,6], m=2, n=6, k=3, batch_size=134400
         [[{{node motion/PnP/einsum_1/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](motion/PnP/einsum_1/Reshape, motion/PnP/einsum_1/Reshape
_1)]]
         [[{{node motion/PnP_2/einsum_7/Reshape_2/_2363}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", 
send_device_incarnation=1, tensor_name="edge_5308_motion/PnP_2/einsum_7/Reshape_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):   
  File "demos/demo_v2d.py", line 82, in <module>
    main(args)
  File "demos/demo_v2d.py", line 64, in main
    depths, poses = deepv2d(images, intrinsics, viz=True, iters=args.n_iters)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/deepv2d.py", line 462, in __call__
    self.update_poses(i)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/deepv2d.py", line 368, in update_poses
    self.poses, self.intrinsics, self.weights = self.sess.run(outputs, feed_dict=feed_dict)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[134400,2,3], b.shape=[134400,3,6], m=2, n=6, k=3, batch_size=134400
         [[node motion/PnP/einsum_1/MatMul (defined at /projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/utils/einsum.py:49)  = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job
:localhost/replica:0/task:0/device:GPU:0"](motion/PnP/einsum_1/Reshape, motion/PnP/einsum_1/Reshape_1)]]
         [[{{node motion/PnP_2/einsum_7/Reshape_2/_2363}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", 
send_device_incarnation=1, tensor_name="edge_5308_motion/PnP_2/einsum_7/Reshape_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'motion/PnP/einsum_1/MatMul', defined at:
  File "demos/demo_v2d.py", line 82, in <module>
    main(args)
  File "demos/demo_v2d.py", line 55, in main
    deepv2d = DeepV2D(cfg, args.model, use_fcrn=args.fcrn, is_calibrated=is_calibrated, mode=args.mode)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/deepv2d.py", line 68, in __init__
    self._build_motion_graph()
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/deepv2d.py", line 129, in _build_motion_graph
    images, depths, intrinsics, edge_inds, init=do_init)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/modules/motion.py", line 282, in forward
    Tij = Tij.keyframe_optim(target, weight, depths, intrinsics)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/geometry/transformation.py", line 364, in keyframe_optim
    J = einsum('...ij,...jk->...ik', jproj, jtran)
  File "/projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/utils/einsum.py", line 49, in einsum
    out = tf.einsum(equation, *inputs)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/special_math_ops.py", line 257, in einsum
    axes_to_sum)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/special_math_ops.py", line 389, in _einsum_reduction
    product = math_ops.matmul(t0, t1)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2019, in matmul
    a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1245, in batch_mat_mul
    "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/homes/grail/xuanluo/anaconda3/envs/deepv2d/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[134400,2,3], b.shape=[134400,3,6], m=2, n=6, k=3, batch_size=134400
         [[node motion/PnP/einsum_1/MatMul (defined at /projects/grail/xuanluo/telepresence/related-packages/DeepV2D/deepv2d/utils/einsum.py:49)  = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](motion/PnP/einsum_1/Reshape, motion/PnP/einsum_1/Reshape_1)]]
         [[{{node motion/PnP_2/einsum_7/Reshape_2/_2363}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5308_motion/PnP_2/einsum_7/Reshape_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

My environment setup is python 3.6.7, tensorflow-gpu 1.12.0 Seems that the problem is the batch size is too big. I have success when I only use 4 images. Can you help?

zachteed commented 4 years ago

Looks like cuda error, I don't think batch size should matter in this case. What GPU are you using?

duanzhimin14 commented 4 years ago

I also have this proiblem ,my cuda is 9.0,tensorflow 1.12.0,how to slove?

apxlwl commented 4 years ago

@zachteed same problem, do you have any solution? Or which CUDA version is required?

Willyzw commented 3 years ago

Same issue for me. After some googling, it seems to have something to do with the special combination of TensorFlow 1.12 + RTX 2080. So after upgrading TensorFlow from 1.12.0 to 1.14.0 along with CUDA 10.0, it finally works for me :)

zlj-cs commented 1 year ago

Same issue for me. After some googling, it seems to have something to do with the special combination of TensorFlow 1.12 + RTX 2080. So after upgrading TensorFlow from 1.12.0 to 1.14.0 along with CUDA 10.0, it finally works for me :) works for me too, many thanks!