kevinduh / san_mrc

Stochastic Answer Networks (SAN) for Machine Reading Comprehension
BSD 3-Clause "New" or "Revised" License
148 stars 47 forks source link

More hidden layers issue #10

Open LearningPytorch opened 5 years ago

LearningPytorch commented 5 years ago

@namisan when I increase the num_hidden layers to 300 I get this weird error: CUDA error after cudaEventDestroy in future dtor: device-side assert triggeredTraceback (most recent call last): File "train.py", line 123, in <module> main() File "train.py", line 81, in main model.update(batch) File "..san_mrc/src/model.py", line 94, in update loss = loss + F.binary_cross_entropy(pred, torch.unsqueeze(label, 1)) * self.opt.get('classifier_gamma', 1) File "../lib/python3.6/site-packages/torch/nn/functional.py", line 1483, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, size_average, reduce) RuntimeError: cudaEventSynchronize in future::wait: device-side assert triggered

Have you faced this issue? or has anyone?

liuzzi commented 5 years ago

Yes I've seen this error too. It doesnt happen right away either, it seems to happen randomly, 8-10 epochs in for me.

namisan commented 5 years ago

I haven't seen this error. Could you share the config to me? Thanks.

LearningPytorch commented 5 years ago

` def model_config(parser): parser.add_argument('--vocab_size', type=int, default=0) parser.add_argument('--covec_on', action='store_false') parser.add_argument('--embedding_dim', type=int, default=300) parser.add_argument('--fasttext_on', action='store_true')

# pos
parser.add_argument('--no_pos', dest='pos_on', action='store_false')
parser.add_argument('--pos_vocab_size', type=int, default=54)
parser.add_argument('--pos_dim', type=int, default=12)
parser.add_argument('--no_ner', dest='ner_on', action='store_false', help='NER_BIO')
parser.add_argument('--ner_vocab_size', type=int, default=41)
parser.add_argument('--ner_dim', type=int, default=8)
parser.add_argument('--no_feat', dest='feat_on', action='store_false')
parser.add_argument('--num_features', type=int, default=4)
# q->p
parser.add_argument('--prealign_on', action='store_false')
parser.add_argument('--prealign_head', type=int, default=1)
parser.add_argument('--prealign_att_dropout', type=float, default=0)
parser.add_argument('--prealign_norm_on', action='store_true')
parser.add_argument('--prealign_proj_on', action='store_true')
parser.add_argument('--prealign_bidi', action='store_true')
parser.add_argument('--prealign_hidden_size', type=int, default=256)
parser.add_argument('--prealign_share', action='store_false')
parser.add_argument('--prealign_residual_on', action='store_true')
parser.add_argument('--prealign_scale_on', action='store_false')
parser.add_argument('--prealign_sim_func', type=str, default='dotproductproject')
parser.add_argument('--prealign_activation', type=str, default='relu')

parser.add_argument('--pwnn_on', action='store_false')
parser.add_argument('--pwnn_hidden_size', type=int, default=256,
                    help='support short con')

##contextual encoding
parser.add_argument('--contextual_hidden_size', type=int, default=256)
parser.add_argument('--contextual_cell_type', type=str, default='lstm')
parser.add_argument('--contextual_weight_norm_on', action='store_true')
parser.add_argument('--contextual_maxout_on', action='store_true')
parser.add_argument('--contextual_residual_on', action='store_true')
parser.add_argument('--contextual_encoder_share', action='store_true')
parser.add_argument('--contextual_num_layers', type=int, default=2)

## mem setting
parser.add_argument('--msum_hidden_size', type=int, default=256)
parser.add_argument('--msum_cell_type', type=str, default='lstm')
parser.add_argument('--msum_weight_norm_on', action='store_true')
parser.add_argument('--msum_maxout_on', action='store_true')
parser.add_argument('--msum_residual_on', action='store_true')
parser.add_argument('--msum_lexicon_input_on', action='store_true')
parser.add_argument('--msum_num_layers', type=int, default=1)

# attention
parser.add_argument('--deep_att_lexicon_input_on', action='store_false')
parser.add_argument('--deep_att_hidden_size', type=int, default=256)
parser.add_argument('--deep_att_sim_func', type=str, default='dotproductproject')
parser.add_argument('--deep_att_activation', type=str, default='relu')
parser.add_argument('--deep_att_norm_on', action='store_false')
parser.add_argument('--deep_att_proj_on', action='store_true')
parser.add_argument('--deep_att_residual_on', action='store_true')
parser.add_argument('--deep_att_share', action='store_false')
parser.add_argument('--deep_att_opt', type=int, default=0)

# self attn
parser.add_argument('--self_attention_on', action='store_false')
parser.add_argument('--self_att_hidden_size', type=int, default=256)
parser.add_argument('--self_att_sim_func', type=str, default='dotproductproject')
parser.add_argument('--self_att_activation', type=str, default='relu')
parser.add_argument('--self_att_norm_on', action='store_true')
parser.add_argument('--self_att_proj_on', action='store_true')
parser.add_argument('--self_att_residual_on', action='store_true')
parser.add_argument('--self_att_dropout', type=float, default=0)
parser.add_argument('--self_att_drop_diagonal', action='store_false')
parser.add_argument('--self_att_share', action='store_false')

# query summary
parser.add_argument('--query_sum_att_type', type=str, default='linear',
                    help='linear/mlp')
parser.add_argument('--query_sum_norm_on', action='store_true')

parser.add_argument('--max_len', type=int, default=5)
parser.add_argument('--decoder_num_turn', type=int, default=10)
parser.add_argument('--decoder_mem_type', type=int, default=1)
parser.add_argument('--decoder_mem_drop_p', type=float, default=0.1)
parser.add_argument('--decoder_opt', type=int, default=0)
parser.add_argument('--decoder_att_hidden_size', type=int, default=256)
parser.add_argument('--decoder_att_type', type=str, default='bilinear',
                    help='bilinear/simple/defualt')
parser.add_argument('--decoder_rnn_type', type=str, default='gru',
                    help='rnn/gru/lstm')
parser.add_argument('--decoder_sum_att_type', type=str, default='bilinear',
                    help='bilinear/simple/defualt')
parser.add_argument('--decoder_weight_norm_on', action='store_true')
parser.add_argument('--classifier_merge_opt', type=int, default=1)
parser.add_argument('--classifier_dropout_p', type=float, default=0.4)
parser.add_argument('--classifier_weight_norm_on', action='store_false')
parser.add_argument('--classifier_gamma', type=float, default=1)
parser.add_argument('--classifier_threshold', type=float, default=0.3)
parser.add_argument('--label_size', type=int, default=1)
return parser

def data_config(parser): parser.add_argument('--v2_on', action='store_true') parser.add_argument('--log_file', default='san.log', help='path for log file.') parser.add_argument('--data_dir', default='data/') parser.add_argument('--meta', default='meta') parser.add_argument('--train_data', default='train_data', help='path to preprocessed training data file.') parser.add_argument('--dev_data', default='dev_data', help='path to preprocessed validation data file.') parser.add_argument('--dev_gold', default='data/dev-v1.1.json', help='path to preprocessed validation data file.') parser.add_argument('--covec_path', default='data/MT-LSTM.pt') parser.add_argument('--glove', default='data/glove.840B.300d.txt', help='path to word vector file.') parser.add_argument('--sort_all', action='store_true', help='sort the vocabulary by frequencies of all words.' 'Otherwise consider question words first.') parser.add_argument('--threads', type=int, default=multiprocessing.cpu_count(), help='number of threads for preprocessing.') return parser

def train_config(parser): parser.add_argument('--cuda', type=bool, default=torch.cuda.is_available(), help='Use GPU acceleration.') parser.add_argument('--log_per_updates', type=int, default=100) parser.add_argument('--epoches', type=int, default=100) parser.add_argument('--batch_size', type=int, default=32) parser.add_argument('--optimizer', default='adamax') parser.add_argument('--grad_clipping', type=float, default=5) parser.add_argument('--weight_decay', type=float, default=0) parser.add_argument('--learning_rate', type=float, default=0.002) parser.add_argument('--momentum', type=float, default=0) parser.add_argument('--vb_dropout', action='store_false') parser.add_argument('--dropout_p', type=float, default=0.1)

# parser.add_argument('--dropout_p', type=float, default=0.35)
parser.add_argument('--dropout_emb', type=float, default=0.4)
parser.add_argument('--dropout_cov', type=float, default=0.4)
parser.add_argument('--dropout_w', type=float, default=0.05)
# scheduler
parser.add_argument('--no_lr_scheduler', dest='have_lr_scheduler', action='store_false')
parser.add_argument('--multi_step_lr', type=str, default='10,20,30')
parser.add_argument('--lr_gamma', type=float, default=0.5)
parser.add_argument('--scheduler_type', type=str, default='ms', help='ms/rop/exp')
parser.add_argument('--fix_embeddings', action='store_true', help='if true, `tune_partial` will be ignored.')
parser.add_argument('--tune_partial', type=int, default=1000, help='finetune top-x embeddings (including <PAD>, <UNK>).')
parser.add_argument('--model_dir', default='checkpoint')
parser.add_argument('--seed', type=int, default=2018)
return parser

`

liuzzi commented 5 years ago

Could this be a torch version issue?

Srinivas-R commented 5 years ago

Do you have free disk space to store the checkpoints? I faced some issues (error was different, but equally unhelpful: Torch unknown error -1", when I switched to bigger hidden sizes, turns out it was lack of disk space. Monitoring the GPU mem might also help.

liuzzi commented 5 years ago

Hm. No its definitely not a disk space issue. I actually tried on two different servers and got the same problem

liuzzi commented 5 years ago

to give a bit more context to the error: this didnt happen until epoch 11 with 300 hidden layers and 8 decoder steps

/pytorch/aten/src/THCUNN/BCECriterion.cu:42: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference<float>, thrust::device_reference<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `input >= 0. && input <= 1.` failed.
Traceback (most recent call last):  
  File "train.py", line 169, in <module>
    main()
  File "train.py", line 125, in main
    model.update(batch)
  File "/mnt/research/san_mrc/src/model.py", line 94, in update
    loss = loss + F.binary_cross_entropy(pred, torch.unsqueeze(label, 1)) * self.opt.get('classifier_gamma', 1)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/functional.py", line 1603, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, reduction)
RuntimeError: reduce failed to synchronize: device-side assert triggered
namisan commented 5 years ago

I still haven't reproduce this error. Here is my docker: allenlao/pytorch-allennlp-rtd

tungloong commented 5 years ago

I have encountered this problem too. Then I probably solve this by changing 'F.binary_cross_entropy' to 'F.binary_cross_entropy_with_logits', referring to https://github.com/pytorch/pytorch/issues/2209. After doing that, the error above not happens but I face another error saying that

Traceback (most recent call last):
  File "train.py", line 165, in <module>
    main()
  File "train.py", line 108, in main
    results, labels = predict_squad(model, dev_data, v2_on=args.v2_on)
  File "/home/tungloong/san_mrc-master/my_utils/data_utils.py", line 33, in predict_squad
    phrase, spans, scores = model.predict(batch)
  File "/home/tungloong/san_mrc-master/src/model.py", line 140, in predict
    s_offset, e_offset = spans[i][s_idx][0], spans[i][e_idx][1]
IndexError: list index out of range

So I not sure about whether it works. And I am not running it with the new config. I'll try it another day.

namisan commented 5 years ago

I'm not sure if you processed the data correctly, e.g., correct SQuAD version. The errors say s_idx/e_idx may be out of range.