jiacheng-ye / kg_one2set

[ACL 2021] Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"
https://arxiv.org/abs/2105.11134
73 stars 13 forks source link

RuntimeError: CUDA error: device-side assert triggered #2

Closed kgarg8 closed 2 years ago

kgarg8 commented 3 years ago

Hi,

Thanks for the nice repo.

I am facing the following error while training the model with kp20k dataset. FYI, I am training with batch_size=2.

08/30/2021 23:41:03 [INFO] train_ml: Epoch 1; batch: 90000; total batch: 90000,avg training ppl: 5.333, loss: 1.674                                                              
08/30/2021 23:43:40 [INFO] train_ml: Epoch 1; batch: 91000; total batch: 91000,avg training ppl: 5.328, loss: 1.673                                                              
08/30/2021 23:46:18 [INFO] train_ml: Epoch 1; batch: 92000; total batch: 92000,avg training ppl: 5.322, loss: 1.672                                                              
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [148,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [148,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [130,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [130,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                     
Traceback (most recent call last):                                                                                                                                                
  File "train.py", line 103, in <module>                                                                                                                                          
    main(opt)                                                                                                                                                                     
  File "train.py", line 85, in main                                                                                                                                               
    train_ml.train_model(model, optimizer, train_data_loader, valid_data_loader, opt)                                                                                             
  File "/home/ubuntu/kg_one2set/train_ml.py", line 44, in train_model
    batch_loss_stat = train_one_batch(batch, model, optimizer, opt)
  File "/home/ubuntu/kg_one2set/train_ml.py", line 146, in train_one_batch
    control_embed = model.decoder.forward_seg(state)
  File "/home/ubuntu/kg_one2set/pykp/decoder/transformer.py", line 153, in forward_seg
    control_idx = torch.arange(0, self.max_kp_num).long().to(device).reshape(1, -1).repeat(batch_size, 1)
RuntimeError: CUDA error: device-side assert triggered

Any suggestions would be appreciated.