nyu-dl / dl4chem-mgm

BSD 3-Clause "New" or "Revised" License
69 stars 10 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #5

Closed glard closed 1 year ago

glard commented 2 years ago

Using backend: pytorch /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/glard/anaconda3/envs/self/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:tensorflow:From /home/glard/doping/dl4chem-mgm/src/model/graph_generator.py:19: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /home/glard/doping/dl4chem-mgm/src/model/graph_generator.py:21: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2022-01-18 13:35:06.430977: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2022-01-18 13:35:06.451785: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3699850000 Hz 2022-01-18 13:35:06.452153: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558cb0b482d0 executing computations on platform Host. Devices: 2022-01-18 13:35:06.452164: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2022-01-18 13:35:06.452267: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2022-01-18 13:35:06.462302: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-01-18 13:35:06.462441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: NVIDIA GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.815 pciBusID: 0000:01:00.0 2022-01-18 13:35:06.462471: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2022-01-18 13:35:06.462490: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2022-01-18 13:35:06.462505: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2022-01-18 13:35:06.462519: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2022-01-18 13:35:06.683946: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2022-01-18 13:35:06.684160: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2022-01-18 13:35:07.202372: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2022-01-18 13:35:07.202642: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-01-18 13:35:07.203193: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-01-18 13:35:07.203657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2022-01-18 13:43:36.841675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-01-18 13:43:36.841695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2022-01-18 13:43:36.841703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2022-01-18 13:43:36.841802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-01-18 13:43:36.841928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-01-18 13:43:36.842032: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-01-18 13:43:36.842127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6878 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6) 2022-01-18 13:43:36.843210: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558cd6a94070 executing computations on platform CUDA. Devices: 2022-01-18 13:43:36.843222: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): NVIDIA GeForce RTX 3070, Compute Capability 8.6 INFO - 01/18/22 13:43:37 - 0:00:00 - ============ Initialized logger ============ INFO - 01/18/22 13:43:37 - 0:00:00 - Random seed is 0 INFO - 01/18/22 13:43:37 - 0:00:00 - ar: False batch_size: 16 binary_classification: False bound_edges: False check_pred_validity: False clip_grad_norm: 10.0 cond_virtual_node: False data_path: data/QM9/QM9_processed.p debug_fixed: False debug_small: False decay_start_iter: 99999999 dim_h: 2048 dim_k: 1 do_not_corrupt: False dump_path: dumped/ edge_mask_frac: 1.0 edge_mask_predict_frac: 1.0 edge_replace_frac: 0.0 edge_replace_predict_frac: 1.0 edge_target_frac: 0.2 edges_per_batch: -1 embed_hs: False equalise: False exp_id: exp_name: QM9_experiment first_iter: 0 force_mask_predict: True force_replace_predict: False fully_connected: False gen_num_iters: 10 gen_num_samples: 0 gen_predict_deterministically: False gen_random_init: False global_connection: False grad_accum_iters: 1 graph2binary_properties_path: data/proteins/pdb_golabels.p graph_properties_path: graph_property_names: [] graph_type: QM9 layer_norm: True load_best: False load_latest: False local_cpu: False log_train_steps: 200 loss_normalisation_type: by_component lr_decay_amount: 0.0 lr_decay_frac: 1.0 lr_decay_interval: 9999999 mask_all_ring_properties: False mask_independently: True mat_N: 2 mat_d_model: 64 mat_dropout: 0.1 mat_h: 8 max_charge: 1 max_epoch: 100000 max_hs: 4 max_nodes: 9 max_steps: 10000000.0 max_target_frac: 0.8 min_charge: -1 min_lr: 0.0 model_name: GraphNN mpnn_name: EdgesFromNodesMPNN mpnn_steps: 4 no_edge_present_type: zeros no_save: False no_update: False node_mask_frac: 1.0 node_mask_predict_frac: 1.0 node_mpnn_name: NbrEWMultMPNN node_replace_frac: 0.0 node_replace_predict_frac: 1.0 node_target_frac: 0.2 normalise_graph_properties: False num_batches: 4 num_binary_graph_properties: 0 num_edge_types: 5 num_epochs: 200 num_graph_properties: 0 num_mpnns: 1 num_node_types: 5 optimizer: adam,lr=0.0001 perturbation_batch_size: 32 perturbation_edges_per_batch: -1 predict_graph_properties: False prediction_data_structs: all pretrained_property_embeddings_path: data/proteins/preprocessed_go_embeddings.npy property_type: None res_conn: False save_all: False seed: 0 seq_output_dim: 768 share_embed: False shuffle: True smiles_path: None smiles_train_split: 0.8 spatial_msg_res_conn: True spatial_postgru_res_conn: False suppress_params: False suppress_train_log: False target_data_structs: both target_frac_inc_after: None target_frac_inc_amount: 0 target_frac_type: random tensorboard: True update_edges_at_end_only: False use_newest_edges: False use_smiles: False val_after: 105 val_batch_size: 2500 val_data_path: data/ChEMBL/ChEMBL_val_processed_hs.p val_dataset_size: -1 val_edge_target_frac: 0.1 val_edges_per_batch: None val_graph2binary_properties_path: None val_graph_properties_path: data/ChEMBL/ChEMBL_val_graph_properties.p val_node_target_frac: 0.1 val_seed: 0 validate_on_train: False warm_up_iters: 1.0 weighted_loss: False INFO - 01/18/22 13:43:37 - 0:00:00 - Running command: python train.py --data_path data/QM9/QM9_processed.p --graph_type QM9 --exp_name QM9_experiment --num_node_types 5 --num_edge_types 5 --max_nodes 9 --layer_norm --spatial_msg_res_conn --batch_size 16 --val_batch_size 2500 --val_after 105 --num_epochs 200 --shuffle --mask_independently --force_mask_predict --optimizer adam,lr=0.0001 --tensorboard INFO - 01/18/22 13:43:37 - 0:00:00 - The experiment will be stored in dumped/QM9_experiment

INFO - 01/18/22 13:43:43 - 0:00:06 - train_loader len is 6651 INFO - 01/18/22 13:43:43 - 0:00:06 - val_loader len is 11 Starting epoch 1 0 Traceback (most recent call last): File "train.py", line 322, in main(params) File "train.py", line 129, in main binary_graph_properties) File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/glard/doping/dl4chem-mgm/src/model/gnn.py", line 186, in forward batch_init_graph = self.mpnnsmpnn_num File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/glard/doping/dl4chem-mgm/src/model/mpnns.py", line 56, in forward updated_nodes, updated_edges = self.mpnn_step_forward(batch_graph, step_num) File "/home/glard/doping/dl4chem-mgm/src/model/mpnns.py", line 77, in mpnn_step_forward_nonfc updated_nodes = self.node_mpnn(batch_graph) File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/glard/doping/dl4chem-mgm/src/model/node_mpnns.py", line 36, in forward nodes = self.update_GRU(msg, g.ndata['nodes']) File "/home/glard/doping/dl4chem-mgm/src/model/node_mpnns.py", line 23, in updateGRU , node_next = self.gru(msg, node) File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/home/glard/anaconda3/envs/self/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 716, in forward self.dropout, self.training, self.bidirectional, self.batch_first) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

omarnmahmood commented 2 years ago

Please make sure that your versions of cuda and pytorch are compatible with each other.