InternalError (see above for traceback): Blas xGEMMBatched launch failed

rein1685 commented 2 years ago

When I execute train_H.py in ImageAlignment/Codes, "InternalError (see above for traceback): Blas xGEMMBatched launch failed". error occured.

File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[2,3,3], b.shape=[2,3,3], m=3, n=3, k=3, batch_size=2 [[node generator/MatMul_27 (defined at /data/ImageAlignment/Codes/H_model.py:29) ]] [[node loss/add_1 (defined at train_H.py:60) ]]

I tried something. Reduce Batch size, check cuda version, nvidia driver version etc... but i can't solve it.

Could you share your development environment(CUDA, CUDNN, GPU, Nvidia Driver version).

nie-lang commented 2 years ago

CUDA 10.0 CUDNN 7.6.5 GPU 2080Ti Nvidia Driver version 430.34

rein1685 commented 2 years ago

Thanks millions! I solve it!

But, I got some promblem in terms of "Loss". I changed the batch size 4 to 2. When I train ImageAlignment model, I could find the fact that Model Loss is "NAN" after 400,000 epoch.

I wonder if chainging batch size can be a problem. Also, I wonder if you got this kind of problem before. If so, Can you tell me how to figure it out before?

nie-lang commented 2 years ago

In fact, we can train the model as expected. Before I release this repository, I retrained the network and it works well.

But some users also tell me about this problem. It might be caused by the mask in the loss function. For an all-zero mask, the loss can reach 0. I'm not sure about this problem.

You can refer to some common tips to avoid gradient explosion to relieve this problem, such as gradient clipping, etc.

rein1685 commented 2 years ago

I tried some ways, But I can't solve it.

Would you share your pretrained model to Google Drive?

Sorry to keep asking for help.

nie-lang commented 2 years ago

Sorry to hear it.

The pretrained models can be found in the "Testing" parts of ImageAlignment.md and ImageReconstruction.md. Please carefully check these two files.

baojunqi commented 2 years ago

Hi, I have the same question too. Could you tell me how you solve it?

nie-lang / UnsupervisedDeepImageStitching

InternalError (see above for traceback): Blas xGEMMBatched launch failed #37