Open rein1685 opened 2 years ago
CUDA 10.0 CUDNN 7.6.5 GPU 2080Ti Nvidia Driver version 430.34
Thanks millions! I solve it!
But, I got some promblem in terms of "Loss". I changed the batch size 4 to 2. When I train ImageAlignment model, I could find the fact that Model Loss is "NAN" after 400,000 epoch.
I wonder if chainging batch size can be a problem. Also, I wonder if you got this kind of problem before. If so, Can you tell me how to figure it out before?
In fact, we can train the model as expected. Before I release this repository, I retrained the network and it works well.
But some users also tell me about this problem. It might be caused by the mask in the loss function. For an all-zero mask, the loss can reach 0. I'm not sure about this problem.
You can refer to some common tips to avoid gradient explosion to relieve this problem, such as gradient clipping, etc.
I tried some ways, But I can't solve it.
Would you share your pretrained model to Google Drive?
Sorry to keep asking for help.
Sorry to hear it.
The pretrained models can be found in the "Testing" parts of ImageAlignment.md and ImageReconstruction.md. Please carefully check these two files.
Hi, I have the same question too. Could you tell me how you solve it?
When I execute train_H.py in ImageAlignment/Codes, "InternalError (see above for traceback): Blas xGEMMBatched launch failed". error occured.
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[2,3,3], b.shape=[2,3,3], m=3, n=3, k=3, batch_size=2 [[node generator/MatMul_27 (defined at /data/ImageAlignment/Codes/H_model.py:29) ]] [[node loss/add_1 (defined at train_H.py:60) ]]
I tried something. Reduce Batch size, check cuda version, nvidia driver version etc... but i can't solve it.
Could you share your development environment(CUDA, CUDNN, GPU, Nvidia Driver version).