RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

bhatiaa1 commented 1 year ago

Hello GRAN team, Apologies this is more request for help than a bug report. I am trying to replicate running the grid example within GRAN.

I have created a conda environment with pytorch 1.2.0 and python 3.7. I also installed the packages listed in requirements.txt using conda from default and conda-forge channels.

I have NVidia driver 515, graphics card is RTX A6000 and I have tried setting the environment to each of CUDA 10.2 and 11.7 on my machine (through PATH and LD_LIBRARY_PATH) with same error returned.

I am running into the following run-time error - _RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTIONFAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Any suggestions on how to resolve this?

Thanks, Amit

runfile('_gran/run_exp.py', args='-c _gran/config/gran_grid.yaml', wdir='_gran', post_mortem=True) Reloaded modules: runner, runner.gran_runner, model, model.gran_mixture_bernoulli, dataset, dataset.gran_data, utils, utils.data_helper, utils.logger, utils.train_helper, utils.arg_helper, utils.eval_helper, utils.dist_helper, utils.vis_helper, utils.data_parallel _gran/utils/arg_helper.py:39: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. config = edict(yaml.load(open(config_file, 'r'))) INFO | 2023-03-21 12:51:49,425 | run_exp.py | line 26 : Writing log file to exp/GRAN/GRANMixtureBernoulli_grid_2023-Mar-21-12-51-49_1266886/log_exp_1266886.txt INFO | 2023-03-21 12:51:49,425 | run_exp.py | line 26 : Writing log file to exp/GRAN/GRANMixtureBernoulli_grid_2023-Mar-21-12-51-49_1266886/log_exp_1266886.txt INFO | 2023-03-21 12:51:49,426 | run_exp.py | line 27 : Exp instance id = 1266886 INFO | 2023-03-21 12:51:49,426 | run_exp.py | line 27 : Exp instance id = 1266886 INFO | 2023-03-21 12:51:49,427 | run_exp.py | line 28 : Exp comment = None INFO | 2023-03-21 12:51:49,427 | run_exp.py | line 28 : Exp comment = None INFO | 2023-03-21 12:51:49,427 | run_exp.py | line 29 : Config = INFO | 2023-03-21 12:51:49,427 | run_exp.py | line 29 : Config = INFO | 2023-03-21 12:51:49,533 | gran_runner.py | line 124 : Train/val/test = 80/20/20 INFO | 2023-03-21 12:51:49,533 | gran_runner.py | line 124 : Train/val/test = 80/20/20 INFO | 2023-03-21 12:51:49,536 | gran_runner.py | line 137 : No Edges vs. Edges in training set = 111.70632737276479 INFO | 2023-03-21 12:51:49,536 | gran_runner.py | line 137 : No Edges vs. Edges in training set = 111.70632737276479

{'dataset': {'data_path': 'data/', 'dev_ratio': 0.2, 'has_node_feat': False, 'is_overwrite_precompute': False, 'is_sample_subgraph': True, 'is_save_split': False, 'loader_name': 'GRANData', 'name': 'grid', 'node_order': 'DFS', 'num_fwd_pass': 1, 'num_subgraph_batch': 50, 'train_ratio': 0.8}, 'device': 'cuda:0', 'exp_dir': 'exp/GRAN', 'exp_name': 'GRANMixtureBernoulli_grid_2023-Mar-21-12-51-49_1266886', 'gpus': [0], 'model': {'block_size': 1, 'dimension_reduce': True, 'edge_weight': 1.0, 'embedding_dim': 128, 'has_attention': True, 'hidden_dim': 128, 'is_sym': True, 'max_num_nodes': 361, 'name': 'GRANMixtureBernoulli', 'num_GNN_layers': 7, 'num_GNN_prop': 1, 'num_canonical_order': 1, 'num_mix_component': 20, 'sample_stride': 1}, 'run_id': '1266886', 'runner': 'GranRunner', 'save_dir': 'exp/GRAN/GRANMixtureBernoulli_grid_2023-Mar-21-12-51-49_1266886', 'seed': 1234, 'test': {'batch_size': 20, 'better_vis': True, 'is_single_plot': False, 'is_test_ER': False, 'is_vis': True, 'num_test_gen': 20, 'num_vis': 20, 'num_workers': 0, 'test_model_dir': 'snapshot_model', 'test_model_name': 'gran_grid.pth', 'vis_num_row': 5}, 'train': {'batch_size': 1, 'display_iter': 10, 'is_resume': False, 'lr': 0.0001, 'lr_decay': 0.3, 'lr_decay_epoch': [100000000], 'max_epoch': 3000, 'momentum': 0.9, 'num_workers': 0, 'optimizer': 'Adam', 'resume_dir': None, 'resume_epoch': 5000, 'resume_model': 'model_snapshot_0005000.pth', 'shuffle': True, 'snapshot_epoch': 100, 'valid_epoch': 50, 'wd': 0.0}, 'use_gpu': True, 'use_horovod': False} <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< max # nodes = 361 || mean # nodes = 210.25 max # edges = 684 || mean # edges = 391.5 /home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:82: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) ERROR | 2023-03-21 12:51:49,593 | run_exp.py | line 42 : Traceback (most recent call last): File "_gran/run_exp.py", line 38, in main runner.train() File "_gran/runner/gran_runner.py", line 248, in train train_loss = model(batch_fwd).mean() File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "_gran/utils/data_parallel.py", line 104, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "_gran/model/gran_mixture_bernoulli.py", line 438, in forward att_idx=att_idx) File "_gran/model/gran_mixture_bernoulli.py", line 218, in _inference node_feat = self.decoder_input(A_pad) # BCN_max X H File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, **kwargs) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/functional.py", line 1369, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

ERROR | 2023-03-21 12:51:49,593 | run_exp.py | line 42 : Traceback (most recent call last): File "_gran/run_exp.py", line 38, in main runner.train() File "_gran/runner/gran_runner.py", line 248, in train train_loss = model(batch_fwd).mean() File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "_gran/utils/data_parallel.py", line 104, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "_gran/model/gran_mixture_bernoulli.py", line 438, in forward att_idx=att_idx) File "_gran/model/gran_mixture_bernoulli.py", line 218, in _inference node_feat = self.decoder_input(A_pad) # BCN_max X H File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, **kwargs) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/home/bhatiaa1/anaconda3/envs/GRAN/lib/python3.7/site-packages/torch/nn/functional.py", line 1369, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

qiyan98 commented 1 year ago

Hi,

I can reproduce this error with torch==1.2.0 in my workstation. The error is likely due to PyTorch stability issue. Once I upgrade to the pytorch 1.12, the problem goes away.

The exact steps I use to set up the python environment are as follows:

conda create --name gran python=3.7
conda activate gran
pip install pip -U 
pip install -r requirements.txt

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

python run_exp.py -c config/gran_grid.yaml

** There is a line that needs to be changed #https://github.com/lrjconan/GRAN/blob/fc9c04a3f002c55acf892f864c03c6040947bc6b/utils/arg_helper.py#L39

It needs to be changed to config = edict(yaml.load(open(config_file, 'r'), Loader=yaml.Loader)).

bhatiaa1 commented 1 year ago

Thanks, I followed the steps suggested above including the change to one of the files, and still run into the same error. Here is my conda environment,

name: test_authors_gran_py37 channels:

defaults dependencies:
_libgcc_mutex=0.1=main
_openmp_mutex=5.1=1_gnu
ca-certificates=2023.01.10=h06a4308_0
certifi=2022.12.7=py37h06a4308_0
ld_impl_linux-64=2.38=h1181459_1
libffi=3.4.2=h6a678d5_6
libgcc-ng=11.2.0=h1234567_1
libgomp=11.2.0=h1234567_1
libstdcxx-ng=11.2.0=h1234567_1
ncurses=6.4=h6a678d5_0
openssl=1.1.1t=h7f8727e_0
python=3.7.16=h7a1cb2a_0
readline=8.2=h5eee18b_0
setuptools=65.6.3=py37h06a4308_0
sqlite=3.41.1=h5eee18b_0
tk=8.6.12=h1ccaba5_0
wheel=0.38.4=py37h06a4308_0
xz=5.2.10=h5eee18b_1
zlib=1.2.13=h5eee18b_0
pip:
- cycler==0.11.0
- decorator==5.1.1
- easydict==1.10
- fonttools==4.38.0
- kiwisolver==1.4.4
- matplotlib==3.5.3
- networkx==2.3
- numpy==1.21.6
- packaging==23.0
- pillow==9.4.0
- pip==23.0.1
- protobuf==3.20.3
- pyemd==1.0.0
- pyparsing==3.0.9
- python-dateutil==2.8.2
- pyyaml==6.0
- scipy==1.7.3
- six==1.16.0
- tensorboardx==2.6
- torch==1.2.0
- torchvision==0.4.0
- tqdm==4.65.0
- typing-extensions==4.5.0 prefix: /home/bhatiaa1/anaconda3/envs/test_authors_gran_py37

The error stack is - $python run_exp.py -c config/gran_grid.yaml

{'dataset': {'data_path': 'data/', 'dev_ratio': 0.2, 'has_node_feat': False, 'is_overwrite_precompute': False, 'is_sample_subgraph': True, 'is_save_split': False, 'loader_name': 'GRANData', 'name': 'grid', 'node_order': 'DFS', 'num_fwd_pass': 1, 'num_subgraph_batch': 50, 'train_ratio': 0.8}, 'device': 'cuda:0', 'exp_dir': 'exp/GRAN', 'exp_name': 'GRANMixtureBernoulli_grid_2023-Mar-27-15-02-47_1904319', 'gpus': [0], 'model': {'block_size': 1, 'dimension_reduce': True, 'edge_weight': 1.0, 'embedding_dim': 128, 'has_attention': True, 'hidden_dim': 128, 'is_sym': True, 'max_num_nodes': 361, 'name': 'GRANMixtureBernoulli', 'num_GNN_layers': 7, 'num_GNN_prop': 1, 'num_canonical_order': 1, 'num_mix_component': 20, 'sample_stride': 1}, 'run_id': '1904319', 'runner': 'GranRunner', 'save_dir': 'exp/GRAN/GRANMixtureBernoulli_grid_2023-Mar-27-15-02-47_1904319', 'seed': 1234, 'test': {'batch_size': 20, 'better_vis': True, 'is_single_plot': False, 'is_test_ER': False, 'is_vis': True, 'num_test_gen': 20, 'num_vis': 20, 'num_workers': 0, 'test_model_dir': 'snapshot_model', 'test_model_name': 'gran_grid.pth', 'vis_num_row': 5}, 'train': {'batch_size': 1, 'display_iter': 10, 'is_resume': False, 'lr': 0.0001, 'lr_decay': 0.3, 'lr_decay_epoch': [100000000], 'max_epoch': 3000, 'momentum': 0.9, 'num_workers': 0, 'optimizer': 'Adam', 'resume_dir': None, 'resume_epoch': 5000, 'resume_model': 'model_snapshot_0005000.pth', 'shuffle': True, 'snapshot_epoch': 100, 'valid_epoch': 50, 'wd': 0.0}, 'use_gpu': True, 'use_horovod': False} <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< max # nodes = 361 || mean # nodes = 210.25 max # edges = 684 || mean # edges = 391.5 INFO | 2023-03-27 15:02:47,695 | gran_runner.py | line 124 : Train/val/test = 80/20/20 INFO | 2023-03-27 15:02:47,697 | gran_runner.py | line 137 : No Edges vs. Edges in training set = 111.70632737276479 /home/bhatiaa1/anaconda3/envs/test_authors_gran_py37/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:82: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) ERROR | 2023-03-27 15:02:49,159 | run_exp.py | line 42 : Traceback (most recent call last): File "run_exp.py", line 38, in main runner.train() File "_gran/runner/gran_runner.py", line 248, in train train_loss = model(batch_fwd).mean() File "/home/bhatiaa1/anaconda3/envs/test_authors_gran_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "_gran/utils/data_parallel.py", line 104, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/bhatiaa1/anaconda3/envs/test_authors_gran_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "_gran/model/gran_mixture_bernoulli.py", line 438, in forward att_idx=att_idx) File "_gran/model/gran_mixture_bernoulli.py", line 218, in _inference node_feat = self.decoder_input(A_pad) # BCN_max X H File "/home/bhatiaa1/anaconda3/envs/test_authors_gran_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/bhatiaa1/anaconda3/envs/test_authors_gran_py37/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/bhatiaa1/anaconda3/envs/test_authors_gran_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(input, **kwargs) File "/home/bhatiaa1/anaconda3/envs/test_authors_gran_py37/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/home/bhatiaa1/anaconda3/envs/test_authors_gran_py37/lib/python3.7/site-packages/torch/nn/functional.py", line 1369, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

qiyan98 commented 1 year ago

@bhatiaa1 Can you please upgrade your pytorch to 1.12? Thanks.

The error is likely due to PyTorch stability issue. Once I upgrade to the pytorch 1.12, the problem goes away.

See the above response for environment setup steps.

bhatiaa1 commented 1 year ago

@qiyan98 - many thanks! I am able to resolve the CUDA error with the pytorch upgrade as you suggestd, the training commences, but I run into the following error - I am now trying the same patch on utils/train_helper.py that you suggested above for arg_helper.py. Will confirm if the training completes successfully this time around.

INFO | 2023-03-27 19:13:09,829 | gran_runner.py | line 266 : NLL Loss @ epoch 0100 iteration 00007960 = 0.00057323178043589 INFO | 2023-03-27 19:13:10,260 | gran_runner.py | line 266 : NLL Loss @ epoch 0100 iteration 00007970 = 0.0006488203653134406 INFO | 2023-03-27 19:13:10,746 | gran_runner.py | line 266 : NLL Loss @ epoch 0100 iteration 00007980 = 0.00035419981577433646 INFO | 2023-03-27 19:13:11,212 | gran_runner.py | line 266 : NLL Loss @ epoch 0100 iteration 00007990 = 0.0011428899597376585 INFO | 2023-03-27 19:13:11,660 | gran_runner.py | line 266 : NLL Loss @ epoch 0100 iteration 00008000 = 0.0028175509069114923 INFO | 2023-03-27 19:13:11,660 | gran_runner.py | line 270 : Saving Snapshot @ epoch 0100 ERROR | 2023-03-27 19:13:11,690 | run_exp.py | line 42 : Traceback (most recent call last): File "run_exp.py", line 38, in main runner.train() File "_gran/runner/gran_runner.py", line 271, in train snapshot(model.module if self.use_gpu else model, optimizer, self.config, epoch + 1, scheduler=lr_scheduler) File "_gran/utils/train_helper.py", line 40, in snapshot config_save = edict(yaml.load(open(save_name, 'r'))) TypeError: load() missing 1 required positional argument: 'Loader'

qiyan98 commented 1 year ago

** There is a line that needs to be changed #

https://github.com/lrjconan/GRAN/blob/fc9c04a3f002c55acf892f864c03c6040947bc6b/utils/arg_helper.py#L39

It needs to be changed to config = edict(yaml.load(open(config_file, 'r'), Loader=yaml.Loader)).

@bhatiaa1 Did you apply this change? It should help to avoid the error you reported.

bhatiaa1 commented 1 year ago

Hi @qiyan98, yes, I applied the same patch on line 40 of utils/train_helper.py as well and the error has gone away. Thanks again for all your help! I think the problem is resolved; I am closing this issue.

lrjconan / GRAN

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` #25