Fails on CUDNN_STATUS_EXECUTION_FAILED

neubig commented 5 years ago

This is the same error that was reported by @gsh2014 in #4, but I figure it'd be better to have it as a separate issue. I'm running into the same problem:

Namespace(action_embed_size=128, answer_prune=True, asdl_file='asdl/lang/py3/py3_asdl.simplified.txt', att_vec_size=256, batch_size=10, beam_size=15, clip_grad=5.0, column_att='affine', cuda=True, decay_lr_every_epoch=False, decode_max_time_step=100, decoder_word_dropout=0.0, dev_file='data/conala/dev.var_str_sep.bin', dropout=0.0, embed_size=128, eval_top_pred_only=False, evaluator='conala_evaluator', field_embed_size=64, glorot_init=True, glove_embed_path=None, hidden_size=256, lang='python', load_model=None, log_every=50, lr=0.001, lr_decay=0.5, lr_decay_after_epoch=15, lstm='lstm', max_epoch=50, max_num_trial=5, mode='train', negative_sample_type='best', no_copy=False, no_input_feed=False, no_parent_field_embed=False, no_parent_field_type_embed=True, no_parent_production_embed=True, no_parent_state=False, no_query_vec_to_action_map=False, optimizer='Adam', parser='default_parser', patience=5, primitive_token_label_smoothing=0.0, ptrnet_hidden_dim=32, query_vec_to_action_diff_map=False, readout='identity', reset_optimizer=False, sample_size=5, save_all_models=False, save_decode_to=None, save_to='saved_models/conala/model.sup.conala.lstm.hidden256.embed128.action128.field64.type64.dr0.0.lr0.001.lr_de0.5.lr_da15.beam15.vocab.var_str_sep.src_freq3.code_freq3.bin.train.var_str_sep.bin.glorot.par_state.seed0', seed=0, sql_db_file=None, src_token_label_smoothing=0.0, sup_attention=False, test_file=None, train_file='data/conala/train.var_str_sep.bin', transition_system='python3', type_embed_size=64, uniform_init=None, valid_every_epoch=1, valid_metric='acc', verbose=False, vocab='data/conala/vocab.var_str_sep.src_freq3.code_freq3.bin', word_dropout=0.0)
Traceback (most recent call last):
  File "exp.py", line 251, in <module>
    train(args)
  File "exp.py", line 71, in train
    if args.cuda: model.cuda()
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 146, in _apply
    module._apply(fn)
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in _apply
    self.flatten_parameters()
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 102, in flatten_parameters
    fn.rnn_desc = rnn.init_rnn_descriptor(fn, handle)
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 42, in init_rnn_descriptor
    cudnn.DropoutDescriptor(handle, dropout_p, fn.dropout_seed)
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 207, in __init__
    self._set(dropout, seed)
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 232, in _set
    ctypes.c_ulonglong(seed),
  File "/home/gneubig/anaconda3/envs/py3torch3cuda9/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 283, in check_error
    raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 8: b'CUDNN_STATUS_EXECUTION_FAILED'

I'm not sure if it's related but, I did find this: https://github.com/pytorch/pytorch/issues/953 That post seemed to indicate it might be an out-of-memory error, so I tried to reduce the batch size and size of the hidden dimensions, but this didn't change anything...

@pcyin: I was able to reproduce the error on ogma, so maybe you'd be able to as well?

neubig commented 5 years ago

FYI: I've found out that the problem was because I'm using a 2080Ti, which fails when you use CUDA less than version 10. The environment suggested by TranX is using CUDA version 9. I started working on fixing this by fixing this issue https://github.com/pcyin/tranX/issues/10 and making a more modern environment, but the newest version of PyTorch doesn't work with tranX, and there are several places that need fixing. Will update when I finish.

neubig commented 5 years ago

Should be fixed by https://github.com/pcyin/tranX/pull/15

chenyangh commented 5 years ago

Hi, Prof. Neubig. I merged your PR onto my fork manually but there were still some issues for WikiSQL task. I made the following changes in the model/wikisql/parser.py file in order to make it work. L 247 From action_prob_var = torch.cat([torch.cat(action_probs_i).log().sum() for action_probs_i in action_probs]) -> action_prob_var = torch.stack([torch.stack(action_probs_i).log().sum() for action_probs_i in action_probs]) L 459 From
new_hyp_scores = torch.cat([x['new_hyp_score'] for x in new_hyp_meta]) -> new_hyp_scores = torch.stack([x['new_hyp_score'].cuda() for x in new_hyp_meta])

pcyin / tranX

Fails on CUDNN_STATUS_EXECUTION_FAILED #14