gitter-lab / nn4dms

Neural networks for deep mutational scanning data
MIT License
66 stars 16 forks source link

Some problems of GCN #4

Closed deconvolution-w closed 2 years ago

deconvolution-w commented 2 years ago

When running the following code, an error will be reported: python code/regression.py @pub/regression_args/ube4b_main_gcn.txt

And, other datasets with gcn also have the same error

deconvolution-w commented 2 years ago

Hello! This is my question, I hope you can answer it. Thanks!

agitter commented 2 years ago

Thanks for reporting this @deconvolution-w. Can you please copy the full error message here so we can debug the problem?

Also, were you running the code in the nn4dms conda environment created with environment.yml?

deconvolution-w commented 2 years ago

Thank you for your reply Yes, i am running the code in the nn4dms conda environment created with environment_gpu.yml.

Traceback (most recent call last):
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[128,102,102], b.shape=[128,102,128], m=102, n=128, k=102, batch_size=128
         [[{{node MatMul_2}}]]
         [[Adam/update/_20]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[128,102,102], b.shape=[128,102,128], m=102, n=128, k=102, batch_size=128
         [[{{node MatMul_2}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "code/regression.py", line 594, in <module>
    main(parsed_args)
  File "code/regression.py", line 583, in main
    evaluations = run_training(data, log_dir, args)
  File "code/regression.py", line 293, in run_training
    avg_step_duration = run_training_epoch(sess, args, igraph, tgraph, data, epoch, step_display_interval)
  File "code/regression.py", line 168, in run_training_epoch
    _, train_loss_value = sess.run([tgraph["train_op"], tgraph["loss"]], feed_dict=feed_dict)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[128,102,102], b.shape=[128,102,128], m=102, n=128, k=102, batch_size=128
         [[node MatMul_2 (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:33) ]]
         [[Adam/update/_20]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[128,102,102], b.shape=[128,102,128], m=102, n=128, k=102, batch_size=128
         [[node MatMul_2 (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:33) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node MatMul_2:
 Tile (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:32)
 Reshape_3 (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:29)

Input Source operations connected to node MatMul_2:
 Tile (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:32)
 Reshape_3 (defined at /mnt/sda1/wh/nn4dms/code/my_pipgcn.py:29)

Original stack trace for 'MatMul_2':
  File "code/regression.py", line 594, in <module>
    main(parsed_args)
  File "code/regression.py", line 583, in main
    evaluations = run_training(data, log_dir, args)
  File "code/regression.py", line 260, in run_training
    igraph, tgraph = build_graph_from_args_dict(args, encoded_data_shape=ed["train"].shape, reset_graph=False)
  File "/mnt/sda1/wh/nn4dms/code/build_tf_model.py", line 209, in build_graph_from_args_dict
    inf_graph = build_inference_graph(args, encoded_data_shape)
  File "/mnt/sda1/wh/nn4dms/code/build_tf_model.py", line 163, in build_inference_graph
    predictions = bg_inference(args["net_file"], adj_mtx, ph_inputs_dict)
  File "/mnt/sda1/wh/nn4dms/code/build_tf_model.py", line 46, in bg_inference
    layer = parsed_spec["layer_func"](**parsed_spec["arguments"])
  File "/mnt/sda1/wh/nn4dms/code/my_pipgcn.py", line 33, in node_average_gc
    neighbor_signals_sep),
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2609, in matmul
    return batch_mat_mul_fn(a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1677, in batch_mat_mul_v2
    "BatchMatMulV2", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/home/wh/anaconda3/envs/nn4dms/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
agitter commented 2 years ago

Thanks for the details, we'll look into this.

Because you are running on a GPU, can you also please tell us what GPU you are using? If it has less memory than the GPUs we tested with, that might be relevant.

deconvolution-w commented 2 years ago

Thanks, I will try to run with cpu. Our gpu is A6000

agitter commented 2 years ago

The problem is most likely that the A6000 is too modern for our GPU environment. This environment uses TensorFlow 1.14 and CUDA 10.0. My recollection is that Ampere generation GPUs don't support that old a version of CUDA. Brief searching lead to a related Stack Overflow post.

I ran into a similar error (InternalError: Blas GEMM launch failed) with an A100 trying to run a TensorFlow 1.14 example with CUDA 10.1 https://github.com/CHTC/templates-GPUs/issues/10

deconvolution-w commented 2 years ago

Thanks, I had run it successfully on CPUs.

agitter commented 2 years ago

That's good news that it runs correctly on CPUs and makes me more confident that the problem is the A6000 and the CUDA version. This Wikipedia page also supports that https://en.wikipedia.org/wiki/CUDA#GPUs_supported

I opened a pull request to clarify in the readme which GPUs are supported. We don't plan to update this TensorFlow 1.x code to support newer GPUs. We are instead developing updated models in PyTorch and plan to release those within a few months.

deconvolution-w commented 2 years ago

This is exciting news. I look forward to your new open source code~