在google colab 执行./train.sh, 出现错误, 不确定是否是torch版本带来的问题

lipiji / SongNet

Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"：https://www.aclweb.org/anthology/2020.acl-main.68/

MIT License

230 stars 40 forks source link

在google colab 执行./train.sh, 出现错误, 不确定是否是torch版本带来的问题 #10

Closed smartmark-pro closed 3 years ago

smartmark-pro commented 3 years ago

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [136, 16, 2304]], which is output 0 of AddBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

version: torch-1.7.0+cu101, python-3.6.9

李老师, 能否告知下你的torch版本啊

lipiji commented 3 years ago

@smartmark-pro I have fixed the code mentioned in your log. Please try it again.

smartmark-pro commented 3 years ago

@lipiji 可以了, 谢谢李老师. 另外这个改完, 发现会出现runtimeError: CUDA out of memory 的错误, 把train.sh脚本中batch_size=8就好了.

smartmark-pro commented 3 years ago

我昨天上网搜了下错误, 大致了解了这个错误, +=这种语法会造成这类错误. 但是仅在biglm中寻找类似错误, 没能发现问题所在. 今天看到您的回复, 对了下size, [136, 16, 2304] 是[max_size, batch_size, hidden_size*3], 假设这个向量指的是代码中的q, 哪来的3啊?