Closed fdlci closed 3 years ago
Hello,
We encountered this error recently too, I pushed a fix for this: cloning the model inputs before sending them to the teacher. One of the latest transformers release have introduced this, I will ask the transformers team if this should happen, or is there something wrong in transformers. Thank you for your report, I hope you pulled the fix before this message, we've been pretty busy with conferences in the last weeks. (I am closing this).
Hi!
I am trying to run the launch_qa_sparse_single.py file from the question answering example from your nn_pruning library. I haven't changed anything from the original code and I get this error:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Running training Num examples = 131754 Num Epochs = 20 Instantaneous batch size per device = 16 Total train batch size (w. parallel, distributed & accumulation) = 16 Gradient Accumulation steps = 1 Total optimization steps = 164700 0%| | 0/164700 [00:00<?, ?it/s]Traceback (most recent call last): File "question_answering/launch_qa_sparse_single.py", line 33, in main() File "question_answering/launch_qa_sparse_single.py", line 23, in main qa.run() File "./question_answering/xp.py", line 324, in run self.train() File "./question_answering/xp.py", line 312, in train model_path= model_path File "/home/ines/NN_pruning/venv_nn_prun/lib/python3.7/site-packages/transformers/trainer.py", line 1120, in train tr_loss += self.training_step(model, inputs) File "/home/ines/NN_pruning/nn_pruning/nn_pruning/sparse_trainer.py", line 86, in training_step return super().training_step(*args, kwargs) File "/home/ines/NN_pruning/venv_nn_prun/lib/python3.7/site-packages/transformers/trainer.py", line 1542, in training_step loss.backward() File "/home/ines/NN_pruning/venv_nn_prun/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/ines/NN_pruning/venv_nn_prun/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [16]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).**
I found several solutions to this problem on the internet but all the solutions I came accross with tell me to change something in the architecture of the model. Unfortunately here, we are using a Trainer from the transformers library so I don't really know how to fix this issue. Thank you for your help.