as_strided() error - Githubissues

logan-markewich commented 3 years ago

I've noticed a few people encountering issues like this:

Traceback (most recent call last):
  File "train.py", line 121, in <module>
    con.train(model[args.model_name], args.save_name)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\config\ConfigBert.py", line 740, in train
    predict_re = model(context_idxs, context_pos, context_ner,
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\code_bert\lsr_bert.py", line 163, in forward
    output = self.reasoner[i](output)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\models\reasoner.py", line 186, in forward
    _, att = self.struc_att(input)
  File "C:\Users\logan\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\logan\Documents\mitacs2\LSR\code\models\reasoner.py", line 66, in forward
    res.as_strided(tmp.size(), [res.stride(0), res.size(2) + 1]).copy_(tmp)
RuntimeError: setStorage: sizes [88, 88], strides [7744, 89], storage offset 0, and itemsize 4 requiring a storage size of 2725888 are out of bounds for storage of size 30976

It has been stated that changing the seed or batch size can help fix this. But the memory requirement for training is quite insane lol. I can try with a batch size of 1 or 2 only, and all the seeds I've tried hit a similar error.

I'm pretty unfamiliar with that exactly is causing this error, but if you can suggest a fix for the code I can try it out! Otherwise, I cannot train.

logan-markewich commented 3 years ago

After some debugging, this happens when a batch isn't full.

tmp.size(0) != res.size(0), which is causing the error in as_strided

This could maybe be fixed by padding each batch to reach the batch size?

ThinkNaive commented 3 years ago

After some debugging, this happens when a batch isn't full.

tmp.size(0) != res.size(0), which is causing the error in as_strided

This could maybe be fixed by padding each batch to reach the batch size?

Have you ever solved this problems? I got this problem when running LSR(bert version) setting BATCH_SZIE from 1 to 12. And I do not know how to address it.

logan-markewich commented 3 years ago

Nope, I just moved on to a different relation extraction model.

If you are curious, currently ATLOP has state-of-the-art on DocRED (63% F1). It is a much simpler model as well, and they provide the trained model checkpoints from their paper.

ThinkNaive commented 3 years ago

You are right, I've run this model with batch_size=12 for 33 hours (no bert version), and still it has not finished yet (currently at epoch 140). And ATLOP is much faster with nice result (I ran ATLOP-bert-base-cased and got re_f1: 61.31%)

nanguoshun commented 3 years ago

Hi @ThinkNaive , Thanks for your attention. For the BERT-based model, we empirically use a large batch size ( > 16 ) for BERT-model for better convergence.

ThinkNaive commented 3 years ago

@nanguoshun Thank you for advice. Maybe I should use gpu with large memory for large batch size. I currently work on 12GB memory.

logan-markewich commented 3 years ago

Yea, 12GB may not be enough. The curse of deep learning 😆

nguyenvanhoang7398 commented 2 years ago

I'm encountering this error even on a machine with 4 16GB GPUs. When this happened I checked the GPU consumption and it was very low, so it couldn't be that the machine is out of GPU memory. I even reduced the batch size to 8 and the hidden dim to 64 but couldn't fix this. Would it be possible for someone to examine this? Thank you.

IKeepMoving commented 2 years ago

Nope, I just moved on to a different relation extraction model.

If you are curious, currently ATLOP has state-of-the-art on DocRED (63% F1). It is a much simpler model as well, and they provide the trained model checkpoints from their paper.

You are right, I've run this model with batch_size=12 for 33 hours (no bert version), and still it has not finished yet (currently at epoch 140). And ATLOP is much faster with nice result (I ran ATLOP-bert-base-cased and got re_f1: 61.31%)

Have you changed its parameters? I run ATLOP-bert-base-cased and just get re_f1: 59%.

logan-markewich commented 2 years ago

@IKeepMoving I used the weights they provide from their GitHub page (see the releases pane on the right side). No need to re-train unless you really want to :)

nanguoshun / LSR

as_strided() error #41