sow model training error

TITC commented 3 years ago

when I trained on a dataset made by sample_test_sow_reap.txt, it gives me the fellow error

input: BOS Y are therefore normally incurred by car makers on the sole basis X the market incentive . EOS
gt output: BOS for automobile manufacturers based on the initiative X the market for Y EOS
BOS X EOS

dev nll per token: 8.729735
done with batch 0 / 4 in epoch 4, loss: 8.444669, time:46
train nll per token : 8.444669 

input: BOS a Y of paper had been taped to X . EOS
gt output: BOS X was a Y of paper . EOS
BOS EOS EOS

input: BOS a higher rate Y has been observed in X compared with infants . EOS
gt output: BOS in X , a higher incidence Y was observed than in infants . EOS
BOS X EOS

input: BOS Y has been observed in X compared with infants . EOS
gt output: BOS in X , Y was observed than in infants . EOS
BOS EOS EOS

dev nll per token: 8.472298
done with batch 0 / 4 in epoch 5, loss: 8.056769, time:46
train nll per token : 8.056769 

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [339,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [339,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, 
···
···
···
Traceback (most recent call last):
  File "sow/train.py", line 301, in <module>
    main(args)
  File "sow/train.py", line 199, in main
    preds = model(curr_inp, curr_out, curr_inp_pos, curr_in_order)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/seq2seq_base.py", line 50, in forward
    device_ids=device_ids.get('encoder', None))
  File "/content/sow-reap-paraphrasing/sow/models/seq2seq_base.py", line 31, in encode
    return self.encoder(inputs, input_postags, input_pos_order, hidden)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/transformer.py", line 79, in forward
    x = block(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/modules/transformer_blocks.py", line 114, in forward
    x, _ = self.attention(x, x, x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/modules/attention.py", line 284, in forward
    MultiHeadAttention, self).forward(query, key, value, key_padding_mask=key_padding_mask, attn_mask=attn_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 783, in forward
    attn_mask=attn_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 3097, in multi_head_attention_forward
    qkv_same = torch.equal(query, key) and torch.equal(key, value)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

TITC commented 3 years ago

Dear author, I have found some links to furthermore confirm the issue. Finally, I find a way to alleviate the issuecuda runtime error (59) through below code add in sow/train.py

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

and then bash shell give relative clear issue as below

  File "/content/sow-reap-paraphrasing/sow/models/transformer.py", line 71, in forward
    y = self.pos_embedder(input_postags).mul_(self.scale_embedding)

I think this part is correlated to your paper, which is Target order r? But I am not sure, cause here is a multiply operation.

Any advice is welcome?

reference: https://discuss.pytorch.org/t/device-side-assert-triggered-at-error/82488/5

TITC commented 3 years ago

ok, as I think here is the problem

    model_config['postag_size'] = len(pos)

above should change to

    model_config['postag_size'] = len(pos)+1

reference: https://blog.csdn.net/Geek_of_CSDN/article/details/86527107

TITC commented 3 years ago

Here still are some things not make sense

pos class is 71, so I can solve the problem by add 1 to model_config['postag_size']

but the dev set made by your script appears 71 is strange. Cause the size of POS is 71, so the index is not possible be 71.

the other aspect is that all values in your dev set provide by your google drive are all below 71. But still has this error, and also can be fixed by add 1.

TITC commented 3 years ago

The reason of POS appears 71 is here

            for p in pos1 + pos2:
                if p not in pos_vocab.keys():
                    pos_vocab[p] = len(pos_vocab)
                    rev_pos_vocab[pos_vocab[p]] = p

add new POS to pos_vocab, but save as new pkl file which lead totrain.py read the previous pos_vocab, furthermore make Embedding size to len(pos_vocab)==70, not 71.

    model_config['postag_size'] = len(pos)

tagoyal commented 3 years ago

Hi, This is an indexing error. Are you using your own data or is this running on the data in the google drive? Is this on the vocabulary that you have created or the one provided in the google drive?

TITC commented 3 years ago

case1 vocabulary and dev dataset from your shared google drive, but the training dataset is created by your provide sample through your script. exist this error
case2 vocabulary came from your shared google drive, but the training dataset and dev dataset is created by your provide sample through your script. exist this error

if you not mind, can reproduce this error use my upload files in Github.

TITC commented 3 years ago

Hi, This is an indexing error. Are you using your own data or is this running on the data in the google drive? Is this on the vocabulary that you have created or the one provided in the google drive?

if the error is caused by index, how to explain the error occurred even run in your provided datasets which shared in google drive and index range is 0~70 not exceed 71?

tagoyal / sow-reap-paraphrasing

sow model training error #13

Any advice is welcome?

case2 vocabulary came from your shared google drive, but the training dataset and dev dataset is created by your provide sample through your script. exist this error