Closed deconvolution-w closed 2 years ago
Hello! This is my question, I hope you can answer it. Thanks!
Thanks for reporting this @deconvolution-w. Can you please copy the full error message here so we can debug the problem?
Also, were you running the code in the nn4dms
conda environment created with environment.yml
?
Thank you for your reply Yes, i am running the code in the nn4dms conda environment created with environment_gpu.yml.
Traceback (most recent call last):
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMMBatched launch failed : a.shape=[128,102,102], b.shape=[128,102,128], m=102, n=128, k=102, batch_size=128
[[{{node MatMul_2}}]]
[[Adam/update/_20]]
(1) Internal: Blas xGEMMBatched launch failed : a.shape=[128,102,102], b.shape=[128,102,128], m=102, n=128, k=102, batch_size=128
[[{{node MatMul_2}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "code/regression.py", line 594, in <module>
main(parsed_args)
File "code/regression.py", line 583, in main
evaluations = run_training(data, log_dir, args)
File "code/regression.py", line 293, in run_training
avg_step_duration = run_training_epoch(sess, args, igraph, tgraph, data, epoch, step_display_interval)
File "code/regression.py", line 168, in run_training_epoch
_, train_loss_value = sess.run([tgraph["train_op"], tgraph["loss"]], feed_dict=feed_dict)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMMBatched launch failed : a.shape=[128,102,102], b.shape=[128,102,128], m=102, n=128, k=102, batch_size=128
[[node MatMul_2 (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:33) ]]
[[Adam/update/_20]]
(1) Internal: Blas xGEMMBatched launch failed : a.shape=[128,102,102], b.shape=[128,102,128], m=102, n=128, k=102, batch_size=128
[[node MatMul_2 (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:33) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node MatMul_2:
Tile (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:32)
Reshape_3 (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:29)
Input Source operations connected to node MatMul_2:
Tile (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:32)
Reshape_3 (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:29)
Original stack trace for 'MatMul_2':
File "code/regression.py", line 594, in <module>
main(parsed_args)
File "code/regression.py", line 583, in main
evaluations = run_training(data, log_dir, args)
File "code/regression.py", line 260, in run_training
igraph, tgraph = build_graph_from_args_dict(args, encoded_data_shape=ed["train"].shape, reset_graph=False)
File "/mnt/sda1/wh/nn4dms/code/build_tf_model.py", line 209, in build_graph_from_args_dict
inf_graph = build_inference_graph(args, encoded_data_shape)
File "/mnt/sda1/wh/nn4dms/code/build_tf_model.py", line 163, in build_inference_graph
predictions = bg_inference(args["net_file"], adj_mtx, ph_inputs_dict)
File "/mnt/sda1/wh/nn4dms/code/build_tf_model.py", line 46, in bg_inference
layer = parsed_spec["layer_func"](**parsed_spec["arguments"])
File "/mnt/sda1/wh/nn4dms/code/my_pipgcn.py", line 33, in node_average_gc
neighbor_signals_sep),
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2609, in matmul
return batch_mat_mul_fn(a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1677, in batch_mat_mul_v2
"BatchMatMulV2", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
Thanks for the details, we'll look into this.
Because you are running on a GPU, can you also please tell us what GPU you are using? If it has less memory than the GPUs we tested with, that might be relevant.
Thanks, I will try to run with cpu. Our gpu is A6000
The problem is most likely that the A6000 is too modern for our GPU environment. This environment uses TensorFlow 1.14 and CUDA 10.0. My recollection is that Ampere generation GPUs don't support that old a version of CUDA. Brief searching lead to a related Stack Overflow post.
I ran into a similar error (InternalError: Blas GEMM launch failed
) with an A100 trying to run a TensorFlow 1.14 example with CUDA 10.1 https://github.com/CHTC/templates-GPUs/issues/10
Thanks, I had run it successfully on CPUs.
That's good news that it runs correctly on CPUs and makes me more confident that the problem is the A6000 and the CUDA version. This Wikipedia page also supports that https://en.wikipedia.org/wiki/CUDA#GPUs_supported
I opened a pull request to clarify in the readme which GPUs are supported. We don't plan to update this TensorFlow 1.x code to support newer GPUs. We are instead developing updated models in PyTorch and plan to release those within a few months.
This is exciting news. I look forward to your new open source code~
When running the following code, an error will be reported: python code/regression.py @pub/regression_args/ube4b_main_gcn.txt
And, other datasets with gcn also have the same error