Open zjreno opened 3 years ago
Hi I have the same problem, what's your conclusion?
Hi,I have a bug about this statement:
Traceback (most recent call last):
File "train.py", line 340, in
Hi,I have a bug about this statement: Traceback (most recent call last): File "train.py", line 340, in train(args, device_id) File "train.py", line 272, in train trainer.train(train_iter_fct, args.train_steps) File "/root/code/BertSum/src/models/trainer.py", line 155, in train self._gradient_accumulation( File "/root/code/BertSum/src/models/trainer.py", line 326, in _gradient_accumulation loss.div(float(normalization)).backward() File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Does it have any relation with the statement?Or have you solv it? Pardon me for my poor English!
Ok,i have already solved the problem.It is about using BCEcross before,you should give a sigmoid layer before the output.
In https://github.com/nlpyang/BertSum/blob/master/src/models/trainer.py#L325 , After sum(), the loss.numel() must be 1 , What different between
(loss/loss.numel()).backward()
withloss.backward()
?So, I guess, the loss.numel() may express the
n_docs
? Can we useloss / normalization
replace(loss/loss.numel())
?