tagoyal / sow-reap-paraphrasing

Contains data/code for the paper "Neural Syntactic Preordering for Controlled Paraphrase Generation" (ACL 2020).
77 stars 9 forks source link

sow model training error #13

Open TITC opened 3 years ago

TITC commented 3 years ago

when I trained on a dataset made by sample_test_sow_reap.txt, it gives me the fellow error

here is training dataset

input: BOS Y are therefore normally incurred by car makers on the sole basis X the market incentive . EOS
gt output: BOS for automobile manufacturers based on the initiative X the market for Y EOS
BOS X EOS

dev nll per token: 8.729735
done with batch 0 / 4 in epoch 4, loss: 8.444669, time:46
train nll per token : 8.444669 

input: BOS a Y of paper had been taped to X . EOS
gt output: BOS X was a Y of paper . EOS
BOS EOS EOS

input: BOS a higher rate Y has been observed in X compared with infants . EOS
gt output: BOS in X , a higher incidence Y was observed than in infants . EOS
BOS X EOS

input: BOS Y has been observed in X compared with infants . EOS
gt output: BOS in X , Y was observed than in infants . EOS
BOS EOS EOS

dev nll per token: 8.472298
done with batch 0 / 4 in epoch 5, loss: 8.056769, time:46
train nll per token : 8.056769 

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [339,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [339,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, 
···
···
···
Traceback (most recent call last):
  File "sow/train.py", line 301, in <module>
    main(args)
  File "sow/train.py", line 199, in main
    preds = model(curr_inp, curr_out, curr_inp_pos, curr_in_order)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/seq2seq_base.py", line 50, in forward
    device_ids=device_ids.get('encoder', None))
  File "/content/sow-reap-paraphrasing/sow/models/seq2seq_base.py", line 31, in encode
    return self.encoder(inputs, input_postags, input_pos_order, hidden)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/transformer.py", line 79, in forward
    x = block(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/modules/transformer_blocks.py", line 114, in forward
    x, _ = self.attention(x, x, x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/modules/attention.py", line 284, in forward
    MultiHeadAttention, self).forward(query, key, value, key_padding_mask=key_padding_mask, attn_mask=attn_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 783, in forward
    attn_mask=attn_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 3097, in multi_head_attention_forward
    qkv_same = torch.equal(query, key) and torch.equal(key, value)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:327
TITC commented 3 years ago

Dear author, I have found some links to furthermore confirm the issue. Finally, I find a way to alleviate the issuecuda runtime error (59) through below code add in sow/train.py

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"  

and then bash shell give relative clear issue as below

  File "/content/sow-reap-paraphrasing/sow/models/transformer.py", line 71, in forward
    y = self.pos_embedder(input_postags).mul_(self.scale_embedding)

I think this part is correlated to your paper, which is Target order r? But I am not sure, cause here is a multiply operation.

Any advice is welcome?

reference: https://discuss.pytorch.org/t/device-side-assert-triggered-at-error/82488/5

TITC commented 3 years ago

ok, as I think here is the problem

    model_config['postag_size'] = len(pos)

above should change to

    model_config['postag_size'] = len(pos)+1

reference: https://blog.csdn.net/Geek_of_CSDN/article/details/86527107

TITC commented 3 years ago

Here still are some things not make sense image

pos class is 71, so I can solve the problem by add 1 to model_config['postag_size']

but the dev set made by your script appears 71 is strange. Cause the size of POS is 71, so the index is not possible be 71.

the other aspect is that all values in your dev set provide by your google drive are all below 71. But still has this error, and also can be fixed by add 1.

TITC commented 3 years ago

The reason of POS appears 71 is here

            for p in pos1 + pos2:
                if p not in pos_vocab.keys():
                    pos_vocab[p] = len(pos_vocab)
                    rev_pos_vocab[pos_vocab[p]] = p

image

add new POS to pos_vocab, but save as new pkl file which lead totrain.py read the previous pos_vocab, furthermore make Embedding size to len(pos_vocab)==70, not 71.

    model_config['postag_size'] = len(pos)
tagoyal commented 3 years ago

Hi, This is an indexing error. Are you using your own data or is this running on the data in the google drive? Is this on the vocabulary that you have created or the one provided in the google drive?

TITC commented 3 years ago
TITC commented 3 years ago

Hi, This is an indexing error. Are you using your own data or is this running on the data in the google drive? Is this on the vocabulary that you have created or the one provided in the google drive?

if the error is caused by index, how to explain the error occurred even run in your provided datasets which shared in google drive and index range is 0~70 not exceed 71?